Hashwanth Gogineni's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Generic Models » Predictive Modelling » Hotel Booking Cancellation Prediction

Hotel Booking Cancellation Prediction

Models Status

Model Overview

Hotel Booking:


The hotel industry is one of the faster-growing businesses of the tourism sector, especially with the rise of giant OTA that makes booking a hotel as easy as it has ever been. According to Portugal's National Institute of Statistics, in 2017, hotel revenue rose approximately 18% to $3.6 billion. The hotel industry's growth could also be seen in Portugal's total number of hotel guests, doubling its population in 2017.


Total of hotel guests in 2017: '20.6 Million'


Total Portugal Population in 2017: '10.31 Million'


According to 'Deloitte Hospitality Atlas 2019', Lisbon is nominated as the most attractive European city for hotel investment.


However, the growing trend of the 'hotel industry' comes with problems too; one of the problems is the rising rate of cancellations in the hotel industry. The cancellation rate rose from under 33% in 2014 to 40% in 2018.


Project Implementation:


Travel companies and hotels can use the project to retain their customers and scale their businesses. The project also provides deep customer data analysis, which can be useful to understand customer behaviour for organizations in the Hotel industry.


Dataset:


The data is originally from the article 'Hotel Booking Demand Datasets,' written by 'Nuno Antonio,' 'Ana Almeida,' and 'Luis Nunes' for Data in Brief, Volume 22, February 2019.


The data was downloaded and cleaned by 'Thomas Mock' and 'Antoine Bichat' for #TidyTuesday during the week of February 11th, 2020.


This data contains booking information for a 'city hotel' and a 'resort hotel.' In addition, it contains information such as when the booking was made, length of stay, the number of adults, children, and babies, and the number of available parking spaces, among other things.




Random Forest:


'Random forest' is a supervised ensemble learning algorithm used for both classifications and regression problems. However, it is mainly used for 'classification' problems as we know that a forest is made up of trees and more trees mean a more robust forest. Similarly, the random forest algorithm creates decision trees on data samples and then gets the prediction from each and finally selects the best solution utilizing voting. It is an ensemble method that is better than a single decision tree because it reduces over-fitting by averaging the result.


The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds.




Understanding Code:


First, let us import the required libraries for the project.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
import pickle
import joblib
from sklearn.metrics import classification_report


And now load the data into the system.


df= pd.read_csv('hotel_bookings.csv')

 


Also, let us have a look at important visualizations of our data.


fig = plt.figure(figsize=(10,5))
sns.countplot(data=df, x = 'arrival_date_month')
plt.xlabel('Month', fontsize=15)
plt.xticks(rotation=45,fontsize=11);


basemap = folium.Map()
guests_map = px.choropleth(country_wise_guests, locations = country_wise_guests['country'],
color = country_wise_guests['No of guests'], hover_name = country_wise_guests['country'])
guests_map.show()


As you can see I used the 'folium.Map()' function to generate a world map.


Coming to the 'Data Preprocessing' part, let us search for missing values in the data.


df.isnull().sum()


df['agent'] = df['agent'].fillna(0)
df['children'] = df['children'].fillna(0)
df['country'] = df['country'].fillna('PRT')
df = df.drop('company', axis = 1)

As you can see, missing values exist in our data. So let us replace the Null values with 'mode' and drop the 'company' column as many missing values exist.


Let us generate a heatmap and see the importance of features in our data.


corr = df.corr()

fig,axes = plt.subplots(1,1,figsize=(20,10))
sns.heatmap(corr, annot= True)
plt.show()


I used the 'sns.heatmap' function to generate a heatmap of our data.
As you can see few features are not important for our model to predict the output, so let us eliminate those features from our data using the 'drop' function.


# dropping columns that are not useful

useless_col = ['arrival_date_year', 'assigned_room_type', 'reservation_status', 'country', 'arrival_date_month']

df.drop(useless_col, axis = 1, inplace = True)


Before we hop into encoding, let us convert the 'reservation_status_date' feature into 'year', 'month' and 'day' features using the 'dt' function.


df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])
df['year'] = df['reservation_status_date'].dt.year
df['month'] = df['reservation_status_date'].dt.month
df['day'] = df['reservation_status_date'].dt.day

df.drop(['reservation_status_date'] , axis = 1, inplace = True)


Now, let us encode the data into numeric data using the 'Label encoder' function.


from sklearn import preprocessing

categorical = ['hotel', 'lead_time', 'arrival_date_week_number',
'arrival_date_day_of_month', 'stays_in_weekend_nights',
'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
'market_segment', 'distribution_channel', 'is_repeated_guest',
'previous_cancellations', 'previous_bookings_not_canceled',
'reserved_room_type', 'booking_changes', 'deposit_type', 'agent',
'days_in_waiting_list', 'customer_type', 'adr',
'required_car_parking_spaces', 'total_of_special_requests',
'is_canceled', 'year', 'month', 'day']
for feature in categorical:
le = preprocessing.LabelEncoder()
df[feature] = le.fit_transform(df[feature])


Before we feed our dataset into the model, we scale our data using the 'StandardScaler' function.


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_val = scaler.transform(X_val)

 


Finally, I used the 'Random Forest' algorithm to solve the use case.


from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators=400)

model_rf.fit(X, Y)

Y_Pred = model_rf.predict(X_val)

As you can see I used the 'RandomForestClassifier' function to use the 'Random Forest' algorithm on our data.

Let us create a classification report of our model for a better understanding of the model's results.



As you can see, the model performed really well on our data.


Finally, let us save the model we trained.


pickle.dump(model_rf,open('model.pkl','wb'))

Thank you for your time.


0 comments