Presentation On Flight Price Prediction 2
Presentation On Flight Price Prediction 2
Highest number of airline preferred by people are Indigo covering 49.48% of the total record. Air Asia, Go First and Vistara
and similar in range. FlyBig has the lowest numbers.
VISUALIZATION
The departure area or source place highly used or people majorly flying from the city is "New Delhi" covering 31.91%
record in the column
We see that "Mumbai" is a close second wherein it covers 21.85% records in the column
Other two famous locations where people chose to fly from are "Bangalore", "Hyderabad" and "Kolkata"
The least travel from location is "Chennai"
VISUALIZATION
When we observe the barplot for Departure hour vs Airline we can see that FlyBig has the highest departure time while
IndiGo has the lowest departure time
Considering the barplot for Arrival time vs Airline we can see that FlyBig has the highest arrival time while Vistara have the
lowest arrival time
Looking at the barplot for Flight duration vs Airline we observe that Ai Asia has the highest flight duration while Alliance
Air has the lowest flight duration collectively
Comparing the barplots for Flight prices vs Airline we can clearly see that Vistara have very high flight prices while the
FlyBig has the lowest fare.
VISUALIZATION
When we observe the barplot for Departure
hour vs Airline we can see that FlyBig has the
highest departure time while IndiGo has the
lowest departure time
Considering the barplot for Arrival time vs
Airline we can see that FlyBig has the highest
arrival time while Vistara have the lowest
arrival time
Looking at the barplot for Flight duration vs
Airline we observe that Ai Asia has the highest
flight duration while Alliance Air has the lowest
flight duration collectively
Comparing the barplots for Flight prices vs
Airline we can clearly see that Vistara have very
high flight prices while the FlyBig has the
lowest fare.
VISUALIZATION
Airfares in Vistara and Air India are high when compared to other airlines.
Flight prices when departing from cities like Chennai and Patna have higher price range but the others are around the
similar range a bit lesser in pricing but not providing a huge difference as such
Similarly, prices when arriving in cities Portblair and Dheradun have high price range
When we consider the layovers for pricing situation then obviously direct flights are cheaper when compared to flights that
have 1 or more stops.
OUTLIERS
A box plot is used to summarize data sets by using the box and
whisker plot method. This function helps to understand the data
summary properly. Box plots can be very useful when we want
to know how the data is distributed and spread. Three types of
quartiles are used in the box plot to plot the data. These values
include the median, maximum, minimum, upper-quartile, and
lower-quartile statistical values. A box plot summarizes this data
in the 25th, 50th, and 75th percentiles.
From the box plot we can notice the outliers present in Duration
and "Price" columns.
Since Price is our target variable so no need to remove outliers in
this column. We have removed Outliers from Duration column
by using Zscore method.
CORRELATION
From the heat map and bar plot we can clearly observe the positive and negative correlation between the label and features.
DATA ANALYSIS STEPS DONE
I have done feature engineering steps like feature extraction and feature selection to improve data normality and linearity.
Identified outliers using boxplots and removed outliers in numerical variables.
Identified skewness using distribution plots and removed skewness using square root transformation method.
Used Pearson’s correlation coefficient to check the correlation between dependent and independent variables. To visualize the
correlation I have used heatmap and bar plot.
I have used StandardScalar method to scale the data to overcome with the issue of data biasness.
Split train and test to build machine learning models. Found best random state and best accuracy. Model building process will be
shown in the further steps.
ASSUMPTIONS:
Firstly, from the problem statement we got to know that it is a Regression type problem for which we used Regression
algorithms to build the model and predicted the price of flight tickets by collecting the from yatra website using web scraping.
Secondly, from the distribution plots I found skewness in Duration column and from box plots I found outliers in target column
and categorical column. Also, based upon the analysis and visualization part we have seen some of the features having
somewhat linear relation with label. So, I assumed these features helps in model building and to predict price of the flight
tickets. Also, this model helps the buyers to understand the future price of the flight tickets.
So, I suggest that the sellers and buyers take this model into consideration the features that were deemed as most important as
seen in this study might help them estimate the flight ticket price.
MODEL BUILDING:
In this problem “Price” is our target variable which is continuous in nature where we need to predict the price of flight tickets.
From this I can conclude that it is a Regression type problem hence I have used following regression algorithms.
After the pre-processing and data cleaning I left with 11 columns including target and with the help of feature importance bar graph
I used these independent features for model building and prediction. The algorithms used on training the data are as follows:
i. Decision Tree Regressor
ii. Random Forest Regressor
iii. Extra Trees Regressor
iv. Gradient Boosting Regressor
v. Extreme Gradient Boosting Regressor (XGB)
vi. Bagging Regressor
vii. KNN Regressor
I have got the best random state and maximum R2 score and then created new train test split to build the above models.
BEST RANDOM STATE
HYPER PARAMETER TUNING
I have used GridSearchCV to get the best parameters of XGB Regressor. And used all the obtained parameters to
get the accuracy of final model.
SAVING THE MODEL AND PREDICTIONS
USING SAVED MODEL