Thyroid Disease Prediction

Prasad Chaskar

Related Listings

Brain Stroke Prediction

0 comments, 4 reviews , 5 likes
PRE-OWNED-CAR-PRICE-P...

0 comments, 1 review , 1 like

Brain Stroke Prediction

0 comments, 4 reviews , 992 views, 5 likes
Pancreatic Cancer Detection

0 comments, 2 reviews , 346 views, 3 likes

Major Concepts

Models Home » Domain Usecases » Health Care and Pharmaceuticals » Thyroid Disease Prediction

Thyroid Disease Prediction

Models Status

Model Overview

Introduction :

Thyroid disease is one of the diseases that afflict the world’s population, and the number of cases of this disease is increasing. Because of medical reports that show serious imbalances in thyroid diseases, our study deals with the classification of thyroid disease between hyperthyroidism and hypothyroidism.
This disease was classified using algorithms.
Machine learning showed us good results using several algorithms and was built in the form of two models.

Required Libraries :
1. pandas
2. numpy
3. matplotlib
4. sklearn
5. seaborn

Problem Statement :

To predict whether a person has a thyroid disease or not based on the various biological and physical parameters of the body.
To make a model having high accuracy and precision and can predict the results with greater confidence.

Data Description:

The dataset contains 3772 training instances and 3 classes.
Dataset Link - https://archive.ics.uci.edu/ml/datasets/thyroid+disease

Features :

- age: The person's age in years
- sex: The person's sex
- TSH (Thyroid-stimulating hormone): blood test that measures this hormone.
- T3:Triiodothyronine is a thyroid hormone.
- thyroid surgery: done surgery or not.
- TT4: the main form of thyroid hormone made by the thyroid gland.
- T4U: normal subjects.
- FTI: Free thyroxine.

Import Libraries:

#pandas

import pandas as pd



#numpy

import numpy as np



#matplotlib

import matplotlib.pyplot as plt



#seaborn

import seaborn as sns



#sklearn

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.metrics import classification_report,f1_score

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import LabelEncoder

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

import pickle

import warnings

warnings.filterwarnings('ignore')

Read Data from csv :

thyroid_df = pd.read_csv('hypothyroid.csv')

thyroid_df.head()

Handling Missing Values :

thyroid_df.isnull().sum()

# fill missing values with mean of that column.

miss_cols = ['FTI','TSH','T3','TT4','T4U']

for i in miss_cols:

    thyroid_df[i] = thyroid_df[i].fillna(thyroid_df[i].mean())



# drop remaining null values

thyroid_df.dropna(inplace=True)

# change dtype of columns which are initially objects

thyroid_df.TT4 = thyroid_df.TT4.astype(int)

thyroid_df.FTI = thyroid_df.FTI.astype(int)

thyroid_df.age = thyroid_df.age.astype(int)

EDA :
Count plot for Target Column :

sns.countplot(x='Label',data=thyroid_df)

plt.title("Countplot for Target variable");

Perform EDA for Positive class :

positive_df = thyroid_df[thyroid_df.Label=='P']



plt.figure(figsize=(9,6))

sns.histplot(x='age',data=positive_df,color='blue')

plt.title("Distribution of Positive Class Based on Age",{'fontsize':20});

Inference :

The most of patients who suffer from thyroid belonging to age group between 50-70

Pie chart for Sex feature :

plt.figure(figsize=(10,8))

plt.pie(x=positive_df.sex.value_counts(),

        labels=['Female','Male'],

        startangle = 90,

        colors=['springgreen','orange'],

        autopct='%.2f'

       );

plt.legend();

Inference :

Female patients who has disease is greater than male patients.

plt.figure(figsize=(8,8))

plt.pie(x=positive_df.sick.value_counts(),

        labels=['Sick','Well'],

        startangle = 20,

        colors=['deepskyblue','red'],

        autopct='%.2f',

        explode=[0,0.2]

       );

plt.legend();

Transform non-numerical labels to numerical labels :

s_encoder = LabelEncoder()

si_encoder = LabelEncoder()

preg_encoder = LabelEncoder()

th_encoder = LabelEncoder()

treat_encoder = LabelEncoder()

lith_encoder = LabelEncoder()

g_encoder= LabelEncoder()

tu_encoder = LabelEncoder()



X['sex'] = s_encoder.fit_transform(X.sex)

X['I131 treatment'] = treat_encoder.fit_transform(X['I131 treatment'])

X['sick'] = si_encoder.fit_transform(X.sick)

X['pregnant'] = preg_encoder.fit_transform(X.pregnant)

X['thyroid surgery'] = th_encoder.fit_transform(X['thyroid surgery'])

X['lithium'] = lith_encoder.fit_transform(X['lithium'])

X['goitre'] = g_encoder.fit_transform(X['goitre'])

X['tumor'] = tu_encoder.fit_transform(X['tumor'])



def func(df):

    if df == 'P':

        return 1

    else:

        return 0

Split Data :

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=11)

Handle Imbalance Data :

smote = SMOTE(random_state=11)



x_smote, y_smote = smote.fit_resample(X_train, y_train)



print("Shape before the Oversampling : ",X_train.shape)

print("Shape after the Oversampling : ",x_smote.shape)



#Output

# Shape before the Oversampling :  (2896, 14)

# Shape after the Oversampling :  (5340, 14)

Feature Scaling :

scalr = MinMaxScaler()

#for training data

x_smote.TT4 = scalr.fit_transform(x_smote[['TT4']])

x_smote.age = scalr.fit_transform(x_smote[['age']])

x_smote.FTI = scalr.fit_transform(x_smote[['FTI']])



#for testing data

X_test.TT4 = scalr.transform(X_test[['TT4']])

X_test.age = scalr.transform(X_test[['age']])

X_test.FTI = scalr.transform(X_test[['FTI']])

Build Models :

models = {

    LogisticRegression(max_iter=500):'Logistic Regression',

    SVC():"Support Vector Machine",

    RandomForestClassifier():'Random Forest'

}

for m in models.keys():

    m.fit(x_smote,y_smote)

for model,name in models.items():

     print(f"Accuracy Score for {name} is : ",model.score(X_test,y_test)*100,"%")



#Output

#Accuracy Score for Logistic Regression is :  98.20441988950276 %

#Accuracy Score for Support Vector Machine is :  98.20441988950276 %

#Accuracy Score for Random Forest is :  98.89502762430939 %

Classification Report :

or model,name in models.items():

    y_pred = model.predict(X_test)

    print(f"Classification Report for {name}")  

    print("----------------------------------------------------------")

    print(classification_report(y_test,y_pred))

    print("----------------------------------------------------------")

Random Forest algorithm gives higher accuracy as compared to others,so we choose it for prediction.

Feature Importance For Random Forest :

plt.figure(figsize=(9,7))

feature_imp1 = rf.feature_importances_

sns.barplot(x=feature_imp1, y=X.columns)

# Add labels to your graph

plt.xlabel('Feature Importance Score')

plt.ylabel('Features')

plt.title("Visualizing Important Features For Random Forest ",{'fontsize':25})

plt.show();

# Drop some features

x_smote.drop(['sick', 'pregnant', 'I131 treatment',

              'lithium', 'goitre', 'tumor'], axis=1, inplace=True)

X_test.drop(['sick', 'pregnant', 'I131 treatment',

              'lithium', 'goitre', 'tumor'], axis=1, inplace=True)

Save Model :

with open('thyroid.pkl','wb') as f:

     pickle.dump(new_rf,f)

Thank You for reading

Other details and code avialable on notebook.

0 comments

Advika Banerjee, Maryam Bains, and Prasad Chaskar like this

Related Listings

Prasad Chaskar's other Models Reports

Major Concepts

Thyroid Disease Prediction

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us

Member Sign In

Member Sign In

Create Account

Related Listings

Prasad Chaskar's other Models Reports

Major Concepts

Thyroid Disease Prediction

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us