Reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals and is a key indicator of human progress.
The UN expects that by 2030, countries end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce under‑5 mortality to at least as low as 25 per 1,000 live births.
Parallel to notion of child mortality is of course maternal mortality, which accounts for 295 000 deaths during and following pregnancy and childbirth (as of 2017). The vast majority of these deaths (94%) occurred in low-resource settings, and most could have been prevented.
In light of what was mentioned above, Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more.
Dataset Information
2126 fetal cardiotocograms (CTG) were automatically processed and the respective diagnostic features measured. The CTG were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.
Dataset Information
2126 fetal cardiotocograms (CTG) were automatically processed and the respective diagnostic features measured. The CTG were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.
- FHR baseline (beats per minute);
- number of accelerations per second;
- number of fetal movements per second;
- number of uterine contractions per second;
- number of light decelerations per second;
- number of severe decelerations per second;
- number of prolongued decelerations per second;
- percentage of time with abnormal short term variability;
- mean value of short term variability;
- percentage of time with abnormal long term variability;
- mean value of long term variability;
- width of FHR histogram;
- minimum of FHR histogram;
- maximum of FHR histogram;
- number of histogram peaks;
- number of histogram zeros;
- histogram mode;
- histogram mean;
- histogram median;
- histogram variance; and
- histogram tendency.
This notebook uses the fetal state as the target variable. As above mentioned, fetal state is classified according to 3 situations (N — Normal, S — Suspect or P — Pathologic).
1 Means Normal
2 Means Suspect &
3 Means Pathologic
Here are the dataset link:
https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification
1. IMPORT NECESSARY PYTHON LIBRARY
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import explained_variance_score, r2_score, classification_report
from sklearn.preprocessing import StandardScaler # Normalize the data
from sklearn.model_selection import train_test_split # Split the data
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from time import time
# Measure the efficiency of the model
from sklearn.metrics import mean_absolute_error
2. READ DATA
So, now, we can get data from our dataset, using Pandas fuction read_csv(), because our data was in .csv format. Function head() returns top 5 records from your dataset, here it is used just to check that we read our data correctly.
df = pd.read_csv("fetal_health.csv")
print(df.sample(5))
So, as we can see, we have 22 columns here (21 columns are our input data and the last one column will be used as prediction column). Also, it has 2126 rows, it is 2126 measurements extracted from cardiotocograms and classified by expert obstetricians into 3 categories:
After that, we can print all our dataset columns for future use:
cols = df.columns
print(cols)
3. ANALYZE DATA
One of the most important things is to understand data which you work with. Here we will use some well-known methods for easier understanding of our data.
For all dataset columns we will find some statistical information, like: Mean, Median, Mode (a.k.a The Three M's of Statistics), Standard Deviation and Correlation using Pandas functions.
In the code below we used mean() Pandas function with axis=0 parameter. It is the the axis to iterate over while searching. It means that you want to find mean across all your indexes (in our case indexes are the names of the colunms). So, it finds statistical data from up to down, via all rows for a column. If you will write axis=1 you will find statistical data from left to right, via all columns, for a row.
mean = df.mean(axis=0)
print(mean)
3.2. median
Here we will use median() Pandas function.
median = df.median(axis=0)
print(median)
3.3. mode
Here we will use mode() Pandas function.
mode = df.mode(axis=0)
print(mode)
3.4. correlation
Pandas corrwith() is used to compute pairwise correlation between rows or columns of two DataFrame objects.
correlation = df.corr().round(2)
plt.figure(figsize=(14, 7))
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.show()
sns.set_style('white')
sns.set_palette('coolwarm')
plt.figure(figsize=(13, 6))
plt.title('Distribution of correlation of features')
abs(correlation['fetal_health']).sort_values()[:-1].plot.barh()
plt.show()
4. VISUALIZE DATA
For easier understanding data we can build some plots. In this example, we will check how many records we have in each class. For this task we can use built-in plots to Pandas library. Here you can read more about them.
#STEP-1
plt.figure(figsize=(18,5))
plt.title('FETAL HEALTH CLASSES')
plt.xlabel('Fetal health class')
plt.ylabel('count')
#STEP-2
value_counts = data["fetal_health"].value_counts()
print(value_counts)
#STEP-3
value_counts.plot.bar()
#STEP-4
plt.grid()
plt.show()
fig, ax = plt.subplots(figsize=(14, 6))
sns.kdeplot(df["baseline value"], alpha=0.5, shade=True, ax=ax, hue=df['fetal_health'], palette="coolwarm")
plt.title('Average Heart Rate Distribution', fontsize=18)
ax.set_xlabel("FHR")
ax.set_ylabel("Frequency")
ax.legend(['Pathological', 'Suspect', 'Normal'])
plt.show()
fig, ax = plt.subplots(figsize=(14, 6))
sns.kdeplot(df["accelerations"], alpha=0.5, shade=True, ax=ax, hue=df['fetal_health'], palette="coolwarm")
plt.title('The Relationship of Acceleration With the Health of the Fetus', fontsize=18)
ax.set_xlabel("Accelerations")
ax.set_ylabel("Frequency")
ax.legend(['Pathological', 'Suspect', 'Normal'])
plt.show()
fig, ax = plt.subplots(figsize=(14, 6))
sns.kdeplot(df["uterine_contractions"], alpha=0.5, shade=True, ax=ax, hue=df['fetal_health'], palette="coolwarm")
plt.title('The Relationship of Uterine Contractions With the Health of the Fetus', fontsize=18)
ax.set_xlabel("Uterine Contractions")
ax.set_ylabel("Frequency")
ax.legend(['Pathological', 'Suspect', 'Normal'])
plt.show()
5. SPLIT DATASET INTO TRAIN AND TEST DATA
So, here we will divide our dataset into two parts (for model training and validation). Let it will be 70%:30% respectively. You can also try it with another ratio, like 60%:40%, 80%:20%, 90%:10% and so on.
# Select Features
X = df.drop(columns=['fetal_health'], axis=1)
# Select Target
y = df['fetal_health']
# Set Training and Testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=.2, random_state=44)
print('Shape of training feature:', X_train.shape)
print('Shape of testing feature:', X_test.shape)
print('Shape of training label:', y_train.shape)
print('Shape of testing label:', y_test.shape)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
pickle.dump(scaler, open('scaler.pkl', 'wb'))
Use Different Model To Find Out Various Accuracy And Get Best Model:-
def evaluate_model(model, x_test, y_test):
from sklearn import metrics
# Predict Test Data
y_pred = model.predict(x_test)
# Calculate accuracy, precision, recall, f1-score, and kappa score
acc = metrics.accuracy_score(y_test, y_pred)
prec = metrics.precision_score(y_test, y_pred, average='macro')
rec = metrics.recall_score(y_test, y_pred, average='macro')
f1 = metrics.f1_score(y_test, y_pred, average='macro')
# Display confussion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
return {'acc': acc, 'prec': prec, 'rec': rec, 'f1': f1, 'cm': cm}
regressors = [
LogisticRegression(),
LinearDiscriminantAnalysis(),
KNeighborsClassifier(),
GaussianNB(),
DecisionTreeClassifier(),
SVC(),
]
head = 10
for model in regressors[:head]:
start = time()
model.fit(X_train, y_train)
train_time = time() - start
start = time()
y_pred = model.predict(X_test)
predict_time = time() - start
print(model)
print("\tTraining time: %0.3fs" % train_time)
print("\tPrediction time: %0.3fs" % predict_time)
print("\tExplained variance:", explained_variance_score(y_test, y_pred))
print("\tMean absolute error:", mean_absolute_error(y_test, y_pred))
print("\tR2 score:", r2_score(y_test, y_pred))
print()
svc = SVC()
svc.fit(X_train, y_train)
svc_evaluate = evaluate_model(svc, X_test, y_test)
container = pd.DataFrame(pd.Series(
{'Accuracy': svc_evaluate['acc'], 'Precision': svc_evaluate['prec'], 'Recall': svc_evaluate['rec'],
'F1 Score': svc_evaluate['f1']}, name='Result'))
print(container)
sns.heatmap(svc_evaluate['cm'], annot=True, cmap='coolwarm', cbar=False, linewidths=3, linecolor='w',
xticklabels=['a', 'b', 'c'])
plt.title('Confusion Matrix', fontsize=16)
plt.show()
From above result we see that support vector machine works better as compared to other so we use support vector machine with some hypertuning
from sklearn.model_selection import GridSearchCV
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']}
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=3)
# fitting the model for grid search
grid.fit(X_train, y_train)
# print best parameter after tuning
print(grid.best_params_)
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)
grid_predictions = grid.predict(X_test)
# print classification report
print(classification_report(y_test, grid_predictions))
pickle.dump(grid, open('grid.pkl', 'wb'))
So We finally used Support Vector Machine Which give us accuracy of 95% and f1 score 97%.