Hashwanth Gogineni's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Generic Models » Predictive Modelling » Flight Fare Price Prediction

Flight Fare Price Prediction

Models Status

Model Overview

Air Travel:


Air travel' is a form of travel in vehicles, aeroplanes, jet aircraft, helicopters, hot air balloons, blimps, gliders, hang gliders, parachutes, or anything else that can sustain flight. The use of air travel by travellers has greatly increased in recent times – worldwide; it doubled between the 1980s and the 2000s. Modern air travel is considered much safer than road travel.




Flight Fares:


For a traveller, it is important to know the fare value of a trip, and as prices of flight tickets vary abruptly, it becomes hectic for a user to check different websites and use different deals. A flight fare model will help inform travellers of the best time to buy their flight passes and understand trends in the airline industry.


Flight ticket prices can be difficult to guess; today, we might see a price; when we check the price of the same flight tomorrow; it will change differently. We might have heard travellers often saying that flight ticket prices are very unpredictable. But, as data scientists, we will prove that given the right data, anything can be predicted.


Project Implementation: 

Companies in Airline Industry can use the project to predict flight ticket prices and help them know more about travellers' behaviours and choices.


 Dataset:


The dataset includes prices of flight tickets for various airlines between March and June of 2019 and between various cities.


Size of training set: '10683' records


Features:


Airline: Name of the Airline company.


Date_of_Journey: Actual date of the journey


Source: The source from which the service is provided.


Destination: The destination where the travel ends.


Route: The route was taken by aeroplane to reach the mentioned destination.


Dep_Time: The time when the journey starts.


Arrival_Time: Time of arrival.


Duration: Total duration of the flight.


Total_Stops: Total stops between the source and destination.


Additional_Info: Additional information about the flight


Price: The price of the ticket




XGBoost:


'XGBoost' stands for eXtreme Gradient Boosting. It has become popular recently and is dominating machine learning and Kaggle competitions for 'structured data' because of the algorithm's scalability.


'XGBoost' is an extension to gradient boosted decision trees (GBM) and is specially designed to improve speed and performance.


 



XGBoost Features:


1) Regularized Learning: The regularization term helps smooth the final learned weights to avoid over-fitting the data.


2) Gradient Tree Boosting Technique: The tree ensemble model cannot be optimized using traditional optimization methods in the Euclidean space. However, the model is very well trained in an additive manner instead.


3) Shrinkage and Column Subsampling: Besides the regularized objective, two additional techniques are used to prevent overfitting further. The first technique is shrinkage introduced by Friedman. Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each tree and leaves space for future trees to improve the model.


Understanding Code:

First, let us import the necessary libraries for the project.


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.metrics import r2_score
import joblib
import pickle


Now, let us load the required data into the system.


df = pd.read_excel('Train Data.xlsx',engine='openpyxl')


Before we start preprocessing our data let us explore the data using a few visualizations.


# Source vs Price

sns.catplot(y = "Price", x = "Source", data = df.sort_values("Price", ascending = False), kind="boxen", height = 4, aspect = 3)
plt.show()


As you can see, I used the 'catplot' function to generate a categorical plot of our data.


plt.figure(figsize=(15,10))
plt.xticks(rotation=90)
plt.title('Airline Count plot')
sns.countplot(df['Airline'])


I also used 'matplotlib.pyplot' to generate a countplot of our data's "Airline" feature.

Now, let us dive into the data preprocessing part of the project and search for missing data in our dataframe.


df.isnull().sum()


There are two missing values in the "Route" and "Total_Stops" features, as you can see.

So let us drop the rows which include missing values.


df.dropna(inplace = True)


Now let us preprocess the features, which include time values in them.


df["Journey_day"] = pd.to_datetime(df.Date_of_Journey, format="%d/%m/%Y").dt.day
df["Journey_month"] = pd.to_datetime(df["Date_of_Journey"], format = "%d/%m/%Y").dt.month
df.drop(["Date_of_Journey"], axis = 1, inplace = True)

df["Dep_hour"] = pd.to_datetime(df["Dep_Time"]).dt.hour
df["Dep_min"] = pd.to_datetime(df["Dep_Time"]).dt.minute
df.drop(["Dep_Time"], axis = 1, inplace = True)

df["Arrival_hour"] = pd.to_datetime(df.Arrival_Time).dt.hour
df["Arrival_min"] = pd.to_datetime(df.Arrival_Time).dt.minute
df.drop(["Arrival_Time"], axis = 1, inplace = True)

As you can see, I converted the features "Date_of_Journey", "Dep_Time" and "Arrival_Time" features into "day", "month", "hour" and "minute" formats.


# Duration
duration = list(df["Duration"])

for i in range(len(duration)):
if len(duration[i].split()) != 2: # Check if duration contains only hour or mins
if "h" in duration[i]:
duration[i] = duration[i].strip() + " 0m" # Adds 0 minute
else:
duration[i] = "0h " + duration[i] # Adds 0 hour

duration_hours = []
duration_mins = []
for i in range(len(duration)):
duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extract hours from duration
duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1])) # Extracts only minutes from duration

# Adding Duration column to test set
df["Duration_hours"] = duration_hours
df["Duration_mins"] = duration_mins
df.drop(["Duration"], axis = 1, inplace = True)

I also converted the "Duration" feature into "hour" and "minute" formats using the above code.

Now, let us encode the data before feeding our data into a model.


Destination_encoder=LabelEncoder()
df['Destination'] = Destination_encoder.fit_transform(df['Destination'])
pickle.dump(Destination_encoder, open('Destination_encoder.pkl','wb'))

Source_encoder=LabelEncoder()
df['Source'] = Source_encoder.fit_transform(df['Source'])
pickle.dump(Source_encoder, open('Source_encoder.pkl','wb'))

Airline_encoder=LabelEncoder()
df['Airline'] = Airline_encoder.fit_transform(df['Airline'])
pickle.dump(Airline_encoder, open('Airline_encoder.pkl','wb'))

I used the "LabelEncoder" function to encode our data.

Also, let us preprocess the "Route" feature by splitting and handling missing values.


df['Route_1']=df['Route'].str.split('→ ').str[0]
df['Route_2']=df['Route'].str.split('→ ').str[1]
df['Route_3']=df['Route'].str.split('→ ').str[2]
df['Route_4']=df['Route'].str.split('→ ').str[3]
df['Route_5']=df['Route'].str.split('→ ').str[4]

df['Route_1'].fillna("None",inplace=True)
df['Route_2'].fillna("None",inplace=True)
df['Route_3'].fillna("None",inplace=True)
df['Route_4'].fillna("None",inplace=True)
df['Route_5'].fillna("None",inplace=True)


Let us also encode the route columns using the "LabelEncoder" function.


Route_1_encoder=LabelEncoder()
df["Route_1"]=Route_1_encoder.fit_transform(df['Route_1'])
pickle.dump(Route_1_encoder, open('Route_1_encoder.pkl','wb'))

Route_2_encoder=LabelEncoder()
df["Route_2"]=Route_2_encoder.fit_transform(df['Route_2'])
pickle.dump(Route_2_encoder, open('Route_2_encoder.pkl','wb'))

Route_3_encoder=LabelEncoder()
df["Route_3"]=Route_3_encoder.fit_transform(df['Route_3'])
pickle.dump(Route_3_encoder, open('Route_3_encoder.pkl','wb'))

Route_4_encoder=LabelEncoder()
df["Route_4"]=Route_4_encoder.fit_transform(df['Route_4'])
pickle.dump(Route_4_encoder, open('Route_4_encoder.pkl','wb'))

Route_5_encoder=LabelEncoder()
df["Route_5"]=Route_5_encoder.fit_transform(df['Route_5'])
pickle.dump(Route_5_encoder, open('Route_5_encoder.pkl','wb'))


Before splitting the data, I dropped the "Additional_Info" feature as it contains almost 80% "no_info" value and replaced categorical values in the "Total_Stops" feature.


df.drop(["Route", "Additional_Info"], axis = 1, inplace = True)

# Replacing Total_Stops
df.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True)


Let us split the data for training and testing our data.


Y = df['Price']
X = df.drop('Price', axis = 1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

As you can see, I used the "train_test_split" function to split our dataframe into training and testing sets.

Finally, we need to scale our data before feeding our data into a model.


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

pickle.dump(scaler, open('scaler.pkl','wb'))

I used the "StandardScaler" function to scale the data.

Now, let us dive into the modelling part of the project.


from xgboost import XGBRegressor

xgb = XGBRegressor()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
xgb.score(X_train, y_train)


As you can see, I used the "XGBoost" algorithm to get the most accurate predictions.
I used the "XGBRegressor" function to apply the "XGBoost" algorithm to our data.

Finally, let us check the model's performance using a few metrics. 


print('r2 score', r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


As you can see, the model performed well on the data and is production-ready.



Thank you for your time.


0 comments