Hashwanth Gogineni's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Domain Usecases » Health Care and Pharmaceuticals » Life Expectancy Prediction Using Gradient Boosting Algorithm

Life Expectancy Prediction Using Gradient Boosting Algorithm

Models Status

Model Overview

Life Expectancy


'Life expectancy is an estimate of how many more years a person of a certain age may expect to live. Life expectancy at birth is the most prevalent metric of life expectancy. However, life expectancy is a fictitious figure. It is assumed that the age-specific death rates for the year in question will apply to those born in that year for the rest of their lives. In effect, the estimate extrapolates age-specific mortality (death) rates at a certain period over the full lifetime of the population born (or living) at that time. Sex, age, race, and geographic area all have a significant impact on the score.


As a result, life expectancy is frequently stated for certain groups rather than for the entire population. For example, white girls born in 2003 in the United States have an average life expectancy of 80.4 years.


Local circumstances influence life expectancy. In comparison to more developed nations, life expectancy at birth is significantly low in less developed countries. Because of high infant mortality rates in some less-developed nations, life expectancy at birth may be lower than at age one (very commonly due to infectious disease or lack of access to a fresh, clean water supply).


The life table is used to determine life expectancy. A life table contains information on the population's age-specific mortality rates, which necessitates enumeration data on the number of persons and deaths at each age. Those figures are usually taken from the national census and vital statistics data, and they may be used to compute average life expectancy for each of the population's age groups.




Dataset


Data on life expectancy and health variables for 193 nations were gathered from the same WHO data repository website, while economic data was gathered from the United Nations website. Only the most representative critical elements were picked from all categories of health-related factors. It has been noticed that in the previous 30 years, there has been great growth in the health sector, resulting in an improvement in human mortality rates, particularly in developing countries. As a result, for this study, we used data from 193 nations from the years 2000 to 2015. Individual data files have been combined to form a single data set.




Why Life expectancy Prediction?


The project can help government organizations determine the life expectancy rate of a specific region and take required measures to ensure humans live long and live healthy lives.


 


Gradient Boosting


One of the most powerful machines learning algorithms is the 'Gradient Boosting' technique. As we all know, machine learning algorithm faults may be divided into two categories: bias error and variance error. Gradient boosting is one of the boosting strategies that is used to reduce the model's bias error.


The base estimator of the gradient boosting technique, unlike the Adaboosting algorithm, cannot be specified. The Gradient Boost algorithm's base estimator is fixed, i.e., Decision Stump. We can modify the n estimator of the gradient boosting method, much as AdaBoost. The default value of the n estimator for this algorithm is 100 if the value of the n estimator is not specified.


The gradient boosting approach may be used to forecast continuous and categorical target variables (as a Regressor) (as a Classifier). Mean Square Error (MSE) is the cost function used as a regressor, while Log loss is the cost function used as a classifier.




Understanding Code


First, let us import the required libraries for the project.


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
import joblib
import pickle


Now, let us load the dataset into the system.


df=pd.read_csv("Life Expectancy Data.csv")


Before we start preprocessing our data let us dive deep into the data using a few visualizations.


plt.figure(figsize= [10,7])
order= df.groupby("Country").Life_expectancy.mean().nlargest(20).index
sns.barplot(y= "Country", x= "Life_expectancy", data= df, order= order, palette= "YlGnBu_r")
plt.title("Top 20 Countries with Highest Life Expectancy", fontsize= 18,fontweight="bold", fontstyle="italic")
plt.xticks( fontsize= 12)
plt.yticks(fontsize=12)
plt.ylabel("Country", fontsize= 14, fontweight="bold")
plt.xlabel("Life Expectancy", fontsize=14, fontweight="bold")




plt.figure(figsize= [10,7])
order=df.groupby("Country").Life_expectancy.mean().sort_values(ascending= True)[:20].index
sns.barplot(y= "Country", x= "Life_expectancy", data= df, order= order, palette= "RdYlBu")
plt.title("Top 20 Countries with Lowest Life Expectancy", fontsize= 18,fontweight="bold", fontstyle="italic")
plt.xticks(fontsize= 12)
plt.yticks(fontsize=12)
plt.ylabel("Country", fontsize= 14, fontweight="bold")
plt.xlabel("Life Expectancy", fontsize=14, fontweight="bold")


I also used 'matplotlib.pyplot' to generate bar plots of our data's "Country" feature.

Now, let us explore the data preprocessing part of the project starting with handling missing values.


df.isnull().sum()


As you can see there are missing values in our data.

So let us handle the missing values using "mean".


df.fillna(df.mean(), inplace=True)


Now, let us encode the data before feeding our data into a model.


Country_encoder=LabelEncoder()
df["Country"]=Country_encoder.fit_transform(df['Country'])
pickle.dump(Country_encoder, open('Country_encoder.pkl','wb'))

Status_encoder=LabelEncoder()
df["Status"]=Status_encoder.fit_transform(df['Status'])
pickle.dump(Status_encoder, open('Status_encoder.pkl','wb'))

As you can see I used the "LabelEncoder" function to encode our data.

Let us split the data using the "train_test_split" function into training and testing sets.


Y = df['Life_expectancy']
X = df.drop('Life_expectancy', axis = 1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 9)


Finally, we need to scale our data before feeding our data into a model.


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

I used the "StandardScaler" function to scale the data.

Now, let us dive into the modelling section of the project.


gd= GradientBoostingRegressor()
gd.fit(X_train, y_train)
y_pred = gd.predict(X_test)
gd.score(X_train, y_train)

As  you can see I used "Gradient Boosting" to solve the problem
I used the "GradientBoostingRegressor" function to use the "Gradient Boosting" algorithm.

Now let us have a look at the model's performance report.


print('r2 score', r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Considering the report, we can judge that our model performed really well on the data.



Thank you for your time.


0 comments