Complete Dataset consists of 2 CSV files. One of them is training and other is for testing your model.
Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and last column is the prognosis.
These symptoms are mapped to 42 diseases you can classify these set of symptoms to.
You are required to train your model on training data and test it on testing data
'itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing', 'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity', 'ulcers_on_tongue', 'muscle_wasting', 'vomiting', 'burning_micturition', 'spotting_ urination' ,'fatigue',
'weight_gain', 'anxiety' ,'cold_hands_and_feets' ,'mood_swings', 'weight_loss' ,'restlessness', 'lethargy',
'patches_in_throat', 'irregular_sugar_level', 'cough', 'high_fever', 'sunken_eyes','breathlessness', 'sweating',
'dehydration' ,'indigestion', 'headache', 'yellowish_skin', 'dark_urine' ,'nausea' ,'loss_of_appetite',
'pain_behind_the_eyes', 'back_pain','constipation', 'abdominal_pain', 'diarrhoea', 'mild_fever', 'yellow_urine',
'yellowing_of_eyes', 'acute_liver_failure' ,'fluid_overload', 'swelling_of_stomach', 'swelled_lymph_nodes',
'malaise', 'blurred_and_distorted_vision', 'phlegm' ,'throat_irritation', 'redness_of_eyes', 'sinus_pressure',
'runny_nose', 'congestion', 'chest_pain', 'weakness_in_limbs', 'fast_heart_rate', 'pain_during_bowel_movements',
'pain_in_anal_region', 'bloody_stool', 'irritation_in_anus', 'neck_pain', 'dizziness', 'cramps', 'bruising',
'obesity', 'swollen_legs', 'swollen_blood_vessels', 'puffy_face_and_eyes', 'enlarged_thyroid', 'brittle_nails',
'swollen_extremeties', 'excessive_hunger', 'extra_marital_contacts' ,'drying_and_tingling_lips', 'slurred_speech',
'knee_pain', 'hip_joint_pain', 'muscle_weakness' ,'stiff_neck', 'swelling_joints', 'movement_stiffness', 'spinning_movements',
'loss_of_balance', 'unsteadiness', 'weakness_of_one_body_side', 'loss_of_smell', 'bladder_discomfort',
'foul_smell_of urine', 'continuous_feel_of_urine', 'passage_of_gases', 'internal_itching', 'toxic_look_(typhos)',
'depression', 'irritability', 'muscle_pain', 'altered_sensorium', 'red_spots_over_body', 'belly_pain',
'abnormal_menstruation', 'dischromic _patches', 'watering_from_eyes', 'increased_appetite', 'polyuria', 'family_history',
'mucoid_sputum', 'rusty_sputum', 'lack_of_concentration', 'visual_disturbances', 'receiving_blood_transfusion',
'receiving_unsterile_injections', 'coma', 'stomach_bleeding', 'distention_of_abdomen', 'history_of_alcohol_consumption',
'fluid_overload.1', 'blood_in_sputum', 'prominent_veins_on_calf', 'palpitations', 'painful_walking', 'pus_filled_pimples', 'blackheads', 'scurring', 'skin_peeling',
'silver_like_dusting', 'small_dents_in_nails', 'inflammatory_nails', 'blister', 'red_sore_around_nose', 'yellow_crust_ooze'
In prognosis we have 41 diseases as result:
'(vertigo) Paroymsal Positional Vertigo', 'AIDS', 'Acne', 'Alcoholic hepatitis', 'Allergy', 'Arthritis', 'Bronchial Asthma', 'Cervical spondylosis', 'Chicken pox', 'Chronic cholestasis', 'Common Cold', 'Dengue', 'Diabetes ', 'Dimorphic hemmorhoids(piles)', 'Drug Reaction', 'Fungal infection', 'GERD', 'Gastroenteritis', 'Heart attack', 'Hepatitis B', 'Hepatitis C', 'Hepatitis D', 'Hepatitis E', 'Hypertension ', 'Hyperthyroidism', 'Hypoglycemia', 'Hypothyroidism', 'Impetigo', 'Jaundice', 'Malaria', 'Migraine', 'Osteoarthristis', 'Paralysis (brain hemorrhage)', 'Peptic ulcer diseae', 'Pneumonia', 'Psoriasis', 'Tuberculosis', 'Typhoid', 'Urinary tract infection', 'Varicose veins','hepatitis A'
# importing the library
import pandas as pd
Reading Training Dataset
# Reading Training Dataset
df = pd.read_csv("training_data.csv")
# Checking shape of Dataset
df.shape
(4920, 134)
# Storing prognosis(prediction column) in y_train dataframe
y_train =df["prognosis"]
y_train.head(50)
# deleting prediction column as we have stored in y_train
del df["prognosis"]
# Unnamed column as it is of no use to us.
del df["Unnamed: 133"]
df.head()
# Checking the NULL Values
df.isnull().sum()
# Storing training dataset in X
X = df
# Stroing prediction column in Y
Y = y_train
Splitting the Data Set
# importing sklearn library for train test slpitting
from sklearn.model_selection import train_test_split
X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.1,stratify=Y,random_state=2)
# Checking the shape of test train dataset
print(X.shape, X_train.shape, X_test.shape)
(4920, 132) (4428, 132) (492, 132)
Making Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 200, criterion = 'entropy', random_state = 0)
classifier.fit(X_train,Y_train)
prediction_rfc = classifier.predict(X_test)
import sklearn.metrics as metrics
print('Confusion Matrix: Random Forest Classifier')
print(metrics.confusion_matrix(Y_test, prediction_rfc))
print('\nClassification Report:')
print(metrics.classification_report(Y_test, prediction_rfc))
Accuracy: 1.0
# Reding the Testing Dataset
dft = pd.read_csv("test_data.csv")
# Viweing Dataset
dft.head()
# Storing prediction column of testing dataset in y_test
y_test =dft["prognosis"]
# Cheking Dataset
y_test.head(3)
# Deleting the prognosis column from testing dataset
del dft["prognosis"]
dft.head(3)
# Doing prediction for testing dataset
prediction = clf.predict(dft)
# Printing values of prediction
print(prediction)
# Checking the accuracy of prediction
print("Accuracy: ", metrics.accuracy_score(prediction,y_test))