0% found this document useful (0 votes)
77 views30 pages

Presentation On Flight Price Prediction 2

Uploaded by

bytestech50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views30 pages

Presentation On Flight Price Prediction 2

Uploaded by

bytestech50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 30

FLIGHT PRICE PREDICTION

PRESENTED BY: Saurabh Yadav


INDEX
Introduction
Problem Statement.
Problem Understanding.
What is Housing Price Prediction?
Importance of housing price prediction.
Exploratory data analysis.
Visualizations.
Analysis.
Model Building.
Hyper Parameter Tunning.
Saving the model and predictions from saved best model.
Conclusion.
INTRODUCTION
Airline industry is one of the most sophisticated in its use of dynamic pricing strategies to maximise revenue, based on proprietary
algorithms and hidden variables. That is why the airline companies use complex algorithms to calculate the flight ticket prices.
There are several different factors on which the price of the flight ticket depends. The seller has information about all the factors,
but buyers are able to access limited information only which is not enough to predict the airfare prices. Considering the features
such as departure time, arrival time and time of the day it will give the best time to buy the ticket.
Nowadays, the number of people using flights has increased significantly. It is difficult for airlines to maintain prices since prices
change dynamically due to different conditions. That’s why we will try to use machine learning models to solve this problem.
This can help airlines by predicting what prices they can maintain. It can also help customers to predict future flight prices and
plan their journey accordingly.
PROBLEM STATEMENT:
Anyone who has booked a flight ticket knows how unexpectedly the prices vary. The cheapest available ticket on a given flight
gets more and less expensive over time. This usually happens as an attempt to maximise revenue based on -
1. Time of purchase patterns (making sure last-minute purchases are expensive).
2. Keeping the flight as full as they want it (raising prices on a flight which is filling up in order to reduce sales and hold back
inventory for those expensive last-minute expensive purchases).
Business goal: The main aim of this project is to predict the price of flight tickets based on various features. The purpose of the
paper is to study the factors which influence the fluctuations in the airfare prices and how they are related to the change in the
prices.
Then using this information, build a system that can help buyers whether to buy a ticket or not. So, we will deploy a Machine
Learning model for flight ticket price prediction and analysis. This model will provide the approximate selling price for the flight
tickets based on different features.
PROBLEM UNDERSTANDING
Airlines implement dynamic pricing for their tickets and base their pricing decisions on demand estimation models. The
reason for such a complicated system is that each flight only has a set number of seats to sell, so airlines must regulate
demand. In the case where demand is expected to exceed capacity, the airline may increase prices, to decrease the rate at
which seats fill. On the other hand, a seat that goes unsold represents a loss of revenue and selling that seat for any price
above the service cost for a single passenger would have been a preferable scenario.
Here we are trying to help the buyers to understand the price of the flight tickets by deploying machine learning models.
These models would help the sellers/buyers to understand the flight ticket prices in market and accordingly they would be
able to book their tickets.
Benefits of Flight Price Prediction
Pricing in the airline industry is often compared to a brain game
between carriers and passengers where each party pursues the best
rates. Carriers love selling tickets at the highest price possible while
still not losing consumers to competitors. Passengers are crazy about
buying flights at the lowest cost available while not missing the
chance to get on board. All this makes flight prices fluctuant and hard
to predict. But nothing is impossible for people armed with intellect
and algorithms. Predicting flight prices helps an individuals to know
and understand the future price of the flight tickets.
There are two main use cases of flight price prediction in the travel
industry. OTAs and other travel platforms integrate this feature to
attract more visitors looking for the best rates. Airlines employ the
technology to forecast rates of competitors and adjust their pricing
strategies accordingly.
Data Analysis and Model Building Flowchart
Import Libraries Import Datasets Data
Preprocessing

Identifying EDA & Finding and


Outliers and Visualization Treating Null
Skewness Values

Ordinal Checking Model Building


Encoding Correlation &
VIF

Saving the Model Hyper Parameter R2 score, CV &


& Prediction Tuning evaluation
metrics
EXPLORATORY DATA ANALYSIS
As a first step I have imported required libraries and I have imported the datasets which were in csv format.
Then I did all the statistical analysis like checking shape, nunique(unique value each column contains), value counts, info
etc…..
While checking the info of the datasets I found some columns with more than 80% null values, so these columns will create
skewness in datasets so I decided to drop those columns, since it seem to me as unnecessary.
Then while looking into the value counts I found some columns with more than 85% zero values this also creates skewness in
the model and there are chances of getting model bias so I have dropped those columns with more than 85% zero values.
Exploratory Data Analysis (EDA) Steps
➢ Importing necessary libraries and loading collected dataset as a data frame.
➢ Checked some statistical information like shape, number of unique values present, info, unique (), data types, value count function etc.
➢ Checked null values and found some missing values on column “Meal_Availability” and filled the null values by using mode method.
➢ Taking care of Timestamp variables by converting data types of “Dep_Time” and “Arrival_Time” from object data type into datetime data
types.
➢ Extracted Departure_Hour, Deparutre_Min and Arrival_Hour, Arrival_Min columns from Dep_time and Arrival_Time columns and
dropped these columns after extraction.
➢ The target variable "price" should be continuous numeric data but due to some string values like “,” it was showing as object data type.
So, I replaced this sign by empty space and converted into float data type.
➢ From the value count function of Total_Stops, I found categorical data so replaced them with numeric data according to stops.
➢ Checked statistical description of the data and separated categorical and numeric features.
➢Visualized each feature using seaborn and matplotlib libraries by plotting several categorical and numerical plots.
➢ Identified outliers using box plots.
➢ Checked for skewness and removed skewness in numerical column “Duration” using square root transformation method.
➢ Encoded the columns having object data type using Label Encoder method. Used Pearson’s correlation coefficient to check the
correlation between label and features. With the help of heatmap and correlation bar graph was able to understand the Feature vs Label
relativity.
➢ Separated feature and label data and feature scaling is performed using Standard Scaler method to avoid any kind of data biasness.
Visualization :Univariate Analysis for Numerical Variables
The distribution plot shows how the data has been
distributed in each of the columns.
From the distribution plot we can observe the columns are
somewhat distributed normally as they have no proper bell
shape curve.
The columns like "Duration", " Total_Stops " and "Price"
are skewed to right as the mean value in these columns are
much greater than the median(50%).
Also the data in the column Arrival_Hour skewed to left
since the mean values is less than the median.
Since there is presence of skewness in the data, we need to
remove skewness in the numerical columns to overcome
with any kind of data biasness.
VISUALIZATION

Highest number of airline preferred by people are Indigo covering 49.48% of the total record. Air Asia, Go First and Vistara
and similar in range. FlyBig has the lowest numbers.
VISUALIZATION

The departure area or source place highly used or people majorly flying from the city is "New Delhi" covering 31.91%
record in the column
We see that "Mumbai" is a close second wherein it covers 21.85% records in the column
Other two famous locations where people chose to fly from are "Bangalore", "Hyderabad" and "Kolkata"
The least travel from location is "Chennai"
VISUALIZATION

When we observe the barplot for Departure hour vs Airline we can see that FlyBig has the highest departure time while
IndiGo has the lowest departure time
Considering the barplot for Arrival time vs Airline we can see that FlyBig has the highest arrival time while Vistara have the
lowest arrival time
Looking at the barplot for Flight duration vs Airline we observe that Ai Asia has the highest flight duration while Alliance
Air has the lowest flight duration collectively
Comparing the barplots for Flight prices vs Airline we can clearly see that Vistara have very high flight prices while the
FlyBig has the lowest fare.
VISUALIZATION
When we observe the barplot for Departure
hour vs Airline we can see that FlyBig has the
highest departure time while IndiGo has the
lowest departure time
Considering the barplot for Arrival time vs
Airline we can see that FlyBig has the highest
arrival time while Vistara have the lowest
arrival time
Looking at the barplot for Flight duration vs
Airline we observe that Ai Asia has the highest
flight duration while Alliance Air has the lowest
flight duration collectively
Comparing the barplots for Flight prices vs
Airline we can clearly see that Vistara have very
high flight prices while the FlyBig has the
lowest fare.
VISUALIZATION

Spicejet has the maximum non stop flight


Air India has the maximum no of 1 stop flights
VISUALIZATION

Airfares in Vistara and Air India are high when compared to other airlines.
Flight prices when departing from cities like Chennai and Patna have higher price range but the others are around the
similar range a bit lesser in pricing but not providing a huge difference as such
Similarly, prices when arriving in cities Portblair and Dheradun have high price range
When we consider the layovers for pricing situation then obviously direct flights are cheaper when compared to flights that
have 1 or more stops.
OUTLIERS

A box plot is used to summarize data sets by using the box and
whisker plot method. This function helps to understand the data
summary properly. Box plots can be very useful when we want
to know how the data is distributed and spread. Three types of
quartiles are used in the box plot to plot the data. These values
include the median, maximum, minimum, upper-quartile, and
lower-quartile statistical values. A box plot summarizes this data
in the 25th, 50th, and 75th percentiles.
From the box plot we can notice the outliers present in Duration
and "Price" columns.
Since Price is our target variable so no need to remove outliers in
this column. We have removed Outliers from Duration column
by using Zscore method.
CORRELATION

From the heat map and bar plot we can clearly observe the positive and negative correlation between the label and features.
DATA ANALYSIS STEPS DONE
I have done feature engineering steps like feature extraction and feature selection to improve data normality and linearity.
Identified outliers using boxplots and removed outliers in numerical variables.
Identified skewness using distribution plots and removed skewness using square root transformation method.
Used Pearson’s correlation coefficient to check the correlation between dependent and independent variables. To visualize the
correlation I have used heatmap and bar plot.
I have used StandardScalar method to scale the data to overcome with the issue of data biasness.
Split train and test to build machine learning models. Found best random state and best accuracy. Model building process will be
shown in the further steps.
ASSUMPTIONS:
Firstly, from the problem statement we got to know that it is a Regression type problem for which we used Regression
algorithms to build the model and predicted the price of flight tickets by collecting the from yatra website using web scraping.
Secondly, from the distribution plots I found skewness in Duration column and from box plots I found outliers in target column
and categorical column. Also, based upon the analysis and visualization part we have seen some of the features having
somewhat linear relation with label. So, I assumed these features helps in model building and to predict price of the flight
tickets. Also, this model helps the buyers to understand the future price of the flight tickets.
So, I suggest that the sellers and buyers take this model into consideration the features that were deemed as most important as
seen in this study might help them estimate the flight ticket price.
MODEL BUILDING:
In this problem “Price” is our target variable which is continuous in nature where we need to predict the price of flight tickets.
From this I can conclude that it is a Regression type problem hence I have used following regression algorithms.
After the pre-processing and data cleaning I left with 11 columns including target and with the help of feature importance bar graph
I used these independent features for model building and prediction. The algorithms used on training the data are as follows:
i. Decision Tree Regressor
ii. Random Forest Regressor
iii. Extra Trees Regressor
iv. Gradient Boosting Regressor
v. Extreme Gradient Boosting Regressor (XGB)
vi. Bagging Regressor
vii. KNN Regressor

I have got the best random state and maximum R2 score and then created new train test split to build the above models.
BEST RANDOM STATE
HYPER PARAMETER TUNING

I have used GridSearchCV to get the best parameters of XGB Regressor. And used all the obtained parameters to
get the accuracy of final model.
SAVING THE MODEL AND PREDICTIONS
USING SAVED MODEL

I have saved my best model using .pkl as follows.


Now after saving the best model, loading my saved model and predicting the test values.
CONCLUSION
The case study aims to give an idea of applying Machine Learning algorithms to predict the price of the flight tickets. After the completion of
this project, we got an insight of how to collect data, pre-processing the data, analyse the data, cleaning the data and building a model.
First we collected the flights data from website and it was done by using Web scraping. The framework used for web scraping was Selenium,
which has an advantage of automating our process of collecting data. We collected almost 5303 of data which contained the ticket price of the
flights and other related features. Then, the scrapped data was saved in a excel file so that we can use further and analyse the data.
Then we loaded the dataset and have done data cleaning, EDA process and pre-processing techniques like checking outliers, skewness,
correlation, scaling data etc. And got better insights from data Visualization.
From the visualization we got to know that flight ticket prices change during morning and evening time of the day. From the distribution plots
we came to know that the prices of the flight tickets are going up and down, they are not fixed at a time. Also, from this graph we found prices
are increasing in large amounts. From plots we found that the prices are tending to go up as the time is approaching from morning to evening.
From the categorical plots (bar and box) we came to know that early morning and late night flights are cheaper compared to working hours.
From the categorical plots we found that the flight ticket prices increases as the person get near to departure time. That is last minute flights are
very expensive. From the bar plot we got to know that both Indigo and Spice jet airways almost having same ticket fares.
After separating our train and test data, we started running different ML regression algorithms to find out the best performing model on the
basis of different metrics like R2 Score MAE, MSE, RMSE. We got Extra Trees Regressor as the best model among all the models. On this
basis we performed the Hyper parameter tuning to find out the best parameter and improving the scores. The R2 score increased after tuning so,
we concluded that Extra Trees Regressor as the best model as it was giving high R2 score after tuning.

You might also like