Flight Fare Price Prediction

Hashwanth Gogineni

Related Listings

Real Time Hard Hat De...

0 comments, 1 review , 2 likes
Medical Insurance Cos...

0 comments, 3 reviews , 4 likes

Lyme Disease Detection

0 comments, 1 review , 657 views, 1 like
Census Income Prediction

0 comments, 1 review , 484 views, 1 like

Major Concepts

Models Home » Generic Models » Predictive Modelling » Flight Fare Price Prediction

Flight Fare Price Prediction

Models Status

Model Overview

Air Travel:

Air travel' is a form of travel in vehicles, aeroplanes, jet aircraft, helicopters, hot air balloons, blimps, gliders, hang gliders, parachutes, or anything else that can sustain flight. The use of air travel by travellers has greatly increased in recent times – worldwide; it doubled between the 1980s and the 2000s. Modern air travel is considered much safer than road travel.

Flight Fares:

For a traveller, it is important to know the fare value of a trip, and as prices of flight tickets vary abruptly, it becomes hectic for a user to check different websites and use different deals. A flight fare model will help inform travellers of the best time to buy their flight passes and understand trends in the airline industry.

Flight ticket prices can be difficult to guess; today, we might see a price; when we check the price of the same flight tomorrow; it will change differently. We might have heard travellers often saying that flight ticket prices are very unpredictable. But, as data scientists, we will prove that given the right data, anything can be predicted.

Project Implementation:

Companies in Airline Industry can use the project to predict flight ticket prices and help them know more about travellers' behaviours and choices.

Dataset:

The dataset includes prices of flight tickets for various airlines between March and June of 2019 and between various cities.

Size of training set: '10683' records

Features:

Airline: Name of the Airline company.

Date_of_Journey: Actual date of the journey

Source: The source from which the service is provided.

Destination: The destination where the travel ends.

Route: The route was taken by aeroplane to reach the mentioned destination.

Dep_Time: The time when the journey starts.

Arrival_Time: Time of arrival.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket

XGBoost:

'XGBoost' stands for eXtreme Gradient Boosting. It has become popular recently and is dominating machine learning and Kaggle competitions for 'structured data' because of the algorithm's scalability.

'XGBoost' is an extension to gradient boosted decision trees (GBM) and is specially designed to improve speed and performance.

XGBoost Features:

1) Regularized Learning: The regularization term helps smooth the final learned weights to avoid over-fitting the data.

2) Gradient Tree Boosting Technique: The tree ensemble model cannot be optimized using traditional optimization methods in the Euclidean space. However, the model is very well trained in an additive manner instead.

3) Shrinkage and Column Subsampling: Besides the regularized objective, two additional techniques are used to prevent overfitting further. The first technique is shrinkage introduced by Friedman. Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each tree and leaves space for future trees to improve the model.

Understanding Code:

First, let us import the necessary libraries for the project.

import numpy as np

import pandas as pd 

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder

from sklearn import metrics

from sklearn.metrics import r2_score

import joblib

import pickle

Now, let us load the required data into the system.

df = pd.read_excel('Train Data.xlsx',engine='openpyxl')

Before we start preprocessing our data let us explore the data using a few visualizations.

# Source vs Price



sns.catplot(y = "Price", x = "Source", data = df.sort_values("Price", ascending = False), kind="boxen", height = 4, aspect = 3)

plt.show()

As you can see, I used the 'catplot' function to generate a categorical plot of our data.

plt.figure(figsize=(15,10))

plt.xticks(rotation=90)

plt.title('Airline Count plot')

sns.countplot(df['Airline'])

I also used 'matplotlib.pyplot' to generate a countplot of our data's "Airline" feature.

Now, let us dive into the data preprocessing part of the project and search for missing data in our dataframe.

df.isnull().sum()

There are two missing values in the "Route" and "Total_Stops" features, as you can see.

So let us drop the rows which include missing values.

df.dropna(inplace = True)

Now let us preprocess the features, which include time values in them.

df["Journey_day"] = pd.to_datetime(df.Date_of_Journey, format="%d/%m/%Y").dt.day

df["Journey_month"] = pd.to_datetime(df["Date_of_Journey"], format = "%d/%m/%Y").dt.month

df.drop(["Date_of_Journey"], axis = 1, inplace = True)



df["Dep_hour"] = pd.to_datetime(df["Dep_Time"]).dt.hour

df["Dep_min"] = pd.to_datetime(df["Dep_Time"]).dt.minute

df.drop(["Dep_Time"], axis = 1, inplace = True)



df["Arrival_hour"] = pd.to_datetime(df.Arrival_Time).dt.hour

df["Arrival_min"] = pd.to_datetime(df.Arrival_Time).dt.minute

df.drop(["Arrival_Time"], axis = 1, inplace = True)

As you can see, I converted the features "Date_of_Journey", "Dep_Time" and "Arrival_Time" features into "day", "month", "hour" and "minute" formats.

# Duration

duration = list(df["Duration"])



for i in range(len(duration)):

    if len(duration[i].split()) != 2:    # Check if duration contains only hour or mins

        if "h" in duration[i]:

            duration[i] = duration[i].strip() + " 0m"   # Adds 0 minute

        else:

            duration[i] = "0h " + duration[i]           # Adds 0 hour



duration_hours = []

duration_mins = []

for i in range(len(duration)):

    duration_hours.append(int(duration[i].split(sep = "h")[0]))    # Extract hours from duration

    duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1]))   # Extracts only minutes from duration

# Adding Duration column to test set

df["Duration_hours"] = duration_hours

df["Duration_mins"] = duration_mins

df.drop(["Duration"], axis = 1, inplace = True)

I also converted the "Duration" feature into "hour" and "minute" formats using the above code.

Now, let us encode the data before feeding our data into a model.

Destination_encoder=LabelEncoder()

df['Destination'] = Destination_encoder.fit_transform(df['Destination'])

pickle.dump(Destination_encoder, open('Destination_encoder.pkl','wb'))



Source_encoder=LabelEncoder()

df['Source'] = Source_encoder.fit_transform(df['Source'])

pickle.dump(Source_encoder, open('Source_encoder.pkl','wb'))



Airline_encoder=LabelEncoder()

df['Airline'] = Airline_encoder.fit_transform(df['Airline'])

pickle.dump(Airline_encoder, open('Airline_encoder.pkl','wb'))

I used the "LabelEncoder" function to encode our data.

Also, let us preprocess the "Route" feature by splitting and handling missing values.

df['Route_1']=df['Route'].str.split('→ ').str[0]

df['Route_2']=df['Route'].str.split('→ ').str[1]

df['Route_3']=df['Route'].str.split('→ ').str[2]

df['Route_4']=df['Route'].str.split('→ ').str[3]

df['Route_5']=df['Route'].str.split('→ ').str[4]



df['Route_1'].fillna("None",inplace=True)

df['Route_2'].fillna("None",inplace=True)

df['Route_3'].fillna("None",inplace=True)

df['Route_4'].fillna("None",inplace=True)

df['Route_5'].fillna("None",inplace=True)

Let us also encode the route columns using the "LabelEncoder" function.

Route_1_encoder=LabelEncoder()

df["Route_1"]=Route_1_encoder.fit_transform(df['Route_1'])

pickle.dump(Route_1_encoder, open('Route_1_encoder.pkl','wb'))



Route_2_encoder=LabelEncoder()

df["Route_2"]=Route_2_encoder.fit_transform(df['Route_2'])

pickle.dump(Route_2_encoder, open('Route_2_encoder.pkl','wb'))



Route_3_encoder=LabelEncoder()

df["Route_3"]=Route_3_encoder.fit_transform(df['Route_3'])

pickle.dump(Route_3_encoder, open('Route_3_encoder.pkl','wb'))



Route_4_encoder=LabelEncoder()

df["Route_4"]=Route_4_encoder.fit_transform(df['Route_4'])

pickle.dump(Route_4_encoder, open('Route_4_encoder.pkl','wb'))



Route_5_encoder=LabelEncoder()

df["Route_5"]=Route_5_encoder.fit_transform(df['Route_5'])

pickle.dump(Route_5_encoder, open('Route_5_encoder.pkl','wb'))

Before splitting the data, I dropped the "Additional_Info" feature as it contains almost 80% "no_info" value and replaced categorical values in the "Total_Stops" feature.

df.drop(["Route", "Additional_Info"], axis = 1, inplace = True)



# Replacing Total_Stops

df.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True)

Let us split the data for training and testing our data.

Y = df['Price']

X = df.drop('Price', axis = 1)



from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

As you can see, I used the "train_test_split" function to split our dataframe into training and testing sets.

Finally, we need to scale our data before feeding our data into a model.

from sklearn.preprocessing import StandardScaler



scaler = StandardScaler()



X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)



X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)



pickle.dump(scaler, open('scaler.pkl','wb'))

I used the "StandardScaler" function to scale the data.

Now, let us dive into the modelling part of the project.

from xgboost import XGBRegressor



xgb = XGBRegressor()

xgb.fit(X_train, y_train) 

y_pred = xgb.predict(X_test)

xgb.score(X_train, y_train)

As you can see, I used the "XGBoost" algorithm to get the most accurate predictions.
I used the "XGBRegressor" function to apply the "XGBoost" algorithm to our data.

Finally, let us check the model's performance using a few metrics.

print('r2 score', r2_score(y_test, y_pred))

print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

print('MSE:', metrics.mean_squared_error(y_test, y_pred))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

As you can see, the model performed well on the data and is production-ready.

Thank you for your time.

0 comments

Advika Banerjee and Prasad Chaskar like this

Related Listings

Hashwanth Gogineni's other Models Reports

Major Concepts

Flight Fare Price Prediction

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us

Member Sign In

Member Sign In

Create Account

Related Listings

Hashwanth Gogineni's other Models Reports

Major Concepts

Flight Fare Price Prediction

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us