Heart disease is a fatal human disease, rapidly increases globally in both developed
and undeveloped countries and consequently causes death.
In this use case, the machine learning model predicts if a person will be affected by
cardiovascular disease or not.
Problem statement:Heart disease is a fatal human disease, rapidly increases globally in both developedand undeveloped countries and consequently causes death.In this use case, the machine learning model predicts if a person will be affected bycardiovascular disease or not.This gives a person inference about their health condition, and a person may takeprecautions.Usage domains and advantages:The medical industry can use this model for identifying potential patientswith cardiovascular disease.Individuals can use this model to check their conditionThis model helps the user to take precautions about his cardiovascular status.The early forecasting of cardiovascular disease prediction using this modelresults in the reduction of risks.Cardiovascular disease detection using machine learning is known asintelligent computational predictive systems. These are proven to be veryeffective in many medical industries.Model solution:The problem is about detecting if a person has or has not a cardiovasculardisease, which is a binary classification.To solve the problem, a Machine Learning classification algorithm isconsidered the most suitable. More about the model will be discussed further.Dataset/Data Source:Dataset is from Kaggle healthcare and medical datasetshttps://www.kaggle.com/sulianova/cardiovascular-disease-dataset
Dataset has individual variables which describe the person’s details.Those variables are used for identifying a persons’ condition.
Cardio = pd.read_csv('cardio_train.csv',sep=';')
Features/Variables:● Age | Objective Feature |age| int (days)● Height | Objective Feature |height| int (cm) |● Weight | Objective Feature |weight| float (kg) |● Gender | Objective Feature |gender | categorical code |● Systolic blood pressure | Examination Feature |ap_hi| int |● Diastolic blood pressure | Examination Feature |ap_lo| int |● Cholesterol | Examination Feature |cholesterol| 1: normal, 2: above normal,3: well above normal |● Glucose | Examination Feature |gluc| 1: normal, 2: above normal, 3: wellabove normal |● Smoking | Subjective Feature |smoke| binary |● Alcohol intake | Subjective Feature |alco| binary |● Physical activity | Subjective Feature |active| binary |● Presence or absence of cardiovascular disease | Target Variable |cardio|binary |Words highlighted in blue are column names/column id in the dataset.The variablecardiofrom the above variables is used as the target variable,determining the person’s condition.0 represent the person who doesn’t have cardiovascular disease1 represent the person who has cardiovascular disease0 350211 34979Name: cardio, dtype: int64Dataset shape70000x12- 12 rows, 70000 columnsPreprocessing:Dropped few duplicates - 24 duplicateThe age which is in days format is converted to years format.Columns renamedap_hi is renamed as Systolic_bp, and ap_lo is renamed as Diastolic_bp, which aremore informative when using rather than old column names
Cardio.rename(columns = {'ap_hi':'Systolic_bp'}, inplace = True)
Cardio.rename(columns = {'ap_lo':'Diastolic_bp'}, inplace = True)
Split the dataframe into Independent variable X and target variable y
X=Cardio.drop(['cardio'],axis=1)
y=Cardio['cardio'].copy()
Model Used:
The problem is a classification type, classifying if a person has or has not the
cardiovascular disease.
Decision Tree Classifier: Decision Trees are Supervised Machine Learning where
the data is continuously split according to a specific parameter.
The decision tree uses tree representation to solve the problem that predicts the
value of the target variable. Here leaf node is linked to the class label, and attributes
are on internal nodes.
from sklearn.tree import DecisionTreeClassifier
D_tree=DecisionTreeClassifier(max_depth=7)
D_tree.fit(X,y)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=7, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort='deprecated', random_state=None, splitter='best')
I choose a Decision Tree with max_depth
Max_depth - maximum depth of the tree(longest path from the root node to a leaf).
Criterion - To measure the quality of split, Gini is most trusted for entropy and
information gain.
Min_sample_split - minimum number of samples required for splitting a node. It
takes an int value.
Presort - is used to speed up the process to find the best split of data.
Cross_val_score function from sklearn library uses Stratified K-fold cross-validation
the technique when used with classification dataset
scores1=cross_val_score(D_tree,X,y,cv=1000)
In the above code, D_tree is the model name. Decision tree classifier.
X- dataframe with individual variables used for classification
y- dataframe with the target variable
cv- cross-validation strategy(number of iteration or number of folds) in this case
number of foldes, 1000 fold Stratifies Kfold cross-validation technique
Results:
Below are the results yielded by the model.
Model |
Accuracy |
Precision |
Recall |
F1-Score |
Decision Tree Classifier |
88.57 |
74.18 |
73.64 |
73.52 |