0% found this document useful (0 votes)
41 views34 pages

Final 43

Uploaded by

sohan9908053588
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views34 pages

Final 43

Uploaded by

sohan9908053588
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

CHAPTER 1

INTRODUCTION
1.1 INTRODUCTION

The Indian aviation industry has undergone a significant transformation over the
past two decades. The liberalization of air travel, the rise of low-cost carriers (LCCs), and the
increasing middle-class population have made air travel more accessible. This report focuses
on analysing the fluctuations in Indian airline ticket prices, which is a critical aspect of
understanding the industry's economics.
The study aims to uncover how airlines in India, including both full-service carriers
(FSCs) and low-cost carriers, set their pricing structures. It considers factors such as route
demand, airline competition, operational costs, and regulatory interventions. The introduction
provides an overview of the sector, the importance of price analysis, and its implications on
the Indian economy. The Indian airline industry, one of the fastest-growing aviation markets
in the world, caters to a burgeoning population of over 1.4 billion people. This growth has
triggered fierce competition among airlines to capture market share, making pricing strategies
a central battleground. Over the years, the industry has seen multiple phases of expansion,
with newer airlines entering the market and older ones consolidating their positions through
mergers or strategic alliances. Factors like the growth of the Indian middle class, increased
disposable income, improved airport infrastructure, and government initiatives such as the
UDAN (Ude Desh ka Aam Nagrik) scheme have further fueled domestic air travel.
This section introduces how airlines in India use sophisticated revenue management
techniques to determine ticket prices, often balancing passenger demand, operational
efficiency, and fluctuating external costs. The introduction also emphasizes the challenges
posed by regulatory bodies, such as the Directorate General of Civil Aviation (DGCA), and
their policies on fare.

1.2 PROBLEM STATEMENT

Indian airline ticket prices are highly volatile, with significant price
fluctuations based on various factors, including the time of booking, seasonality, competition,
and external economic variables. The lack of a cohesive strategy across airlines leads to an
unpredictable pricing landscape, which frustrates consumers and limits market transparency.

10
The problem is exacerbated by the sensitivity of ticket prices to fuel costs (a major expense
for airlines), changes in government taxes (such as GST on airfares), and airport fees, all of
which increase overall operating expenses.
Airline ticket prices fluctuate significantly due to various factors, making it challenging for
passengers to predict and plan their travel expenses effectively. This project aims to analyze
and predict airline ticket prices using machine learning models. The goal is to provide insights
and predictions that help passengers make informed decisions about their travel plans.Airline
ticket prices are highly dynamic and influenced by various factors such as time of booking,
departure and arrival times, airline carriers, and class of service. This variability makes it
challenging for passengers to predict the best time to purchase tickets at the lowest prices

1.3 OBJECTIVE OF THE PROJECT

The primary objective of this project is to analyze and understand the factors influencing
airline ticket prices within the Indian aviation market, aiming to develop strategies for
optimizing pricing models. It focuses on identifying key variables such as fuel costs, market
demand, seasonal fluctuations, competition, and government regulations that significantly
impact airfare pricing. Additionally, the project investigates the pricing strategies of both full-
service carriers (FSCs) and low-cost carriers (LCCs), comparing them with global practices to
identify best approaches. Understanding consumer behavior is also a key aspect, as it explores
how passengers’ booking patterns and willingness to pay affect pricing decisions.
To develop a website that provides users with insights and predictions on airline ticket prices,
helping them make informed decisions about when to book their flights. The website will
leverage historical data and machine learning algorithms to analyze and predict ticket prices
based on multiple factors.

1.4 PROJECT DOMAIN

Airline ticket price analysis is the study of pricing trends and patterns in the airline industry,
aimed at understanding the factors that influence the cost of air travel. Airlines use dynamic
pricing models, where ticket prices fluctuate based on demand, seat availability, and the time
of booking. Factors such as seasonality, route popularity, fuel prices, and airport taxes also
significantly impact ticket costs. Prices tend to rise during peak travel times like holidays and
weekends, while routes with high competition may offer lower fares. Airlines employ revenue

11
management systems that use historical data, booking patterns, and competitor pricing to
adjust fares in real time. Machine learning models and data analysis tools are often used to
predict pricing trends, helping airlines optimize revenue and offering travelers insights into
the best times to book tickets. Overall, airline ticket price analysis benefits airlines by
maximizing profits and passengers by enabling cost-effective travel planning.
Seasonal Trends Analysis how prices vary with seasons, holidays, and special events. This
can help in understanding demand patterns.
Route Analysis Study how prices differ for various routes and identify the most cost-effective
options for travellers.

1.5 SCOPE OF THE PROJECT

The scope of an Airline Ticket Price Analysis project involves collecting and analyzing
various factors that influence airline ticket prices, such as booking time, seasonality, route
competition, fuel costs, and airport fees. The project will develop predictive models using
machine learning algorithms to forecast future ticket prices and simulate dynamic pricing
strategies. It will also include data visualization to present pricing trends and offer insights to
both travelers and airlines.

FEATURES OF THE PROJECT

Data Collection: Gather historical and real-time data on airline ticket prices from various
sources.
Data Analysis: Use machine learning algorithms to analyze the data and identify patterns.
Price Prediction: Implement predictive models to forecast future ticket prices.
Alerts and Notifications: Allow users to set alerts for price drops or optimal booking times.
Comparison Tool: Enable users to compare prices across different airlines and booking
platforms

12
1.6 METHODOLOGY

Data Collection Gather historical ticket price data from various sources.
Exploratory Data Analysis (EDA) Perform EDA to understand data distribution and identify
patterns.Model Development Use machine learning algorithms (e.g., linear regression,
decision tree regressor) to build predictive models.
The Data Collection phase involves gathering historical airline ticket price data from
multiple sources, including airline websites, travel portals, and APIs. Tools like Selenium and
BeautifulSoup can be employed to scrape real-time or past pricing information, while public
datasets may provide additional data. Key features collected include the booking date, travel
date, airline, departure and arrival cities, cabin class, and ticket price. Ensuring the data is
diverse and representative is crucial to prevent bias and enhance the model's predictive
power.Next, in the Data Preprocessing step, the gathered data undergoes thorough cleaning
and preparation. This process involves handling missing or incorrect values, normalizing
formats such as dates and times, and encoding categorical variables like airline names and
city codes into numerical values suitable for machine learning algorithms. Feature
engineering is also applied to create new variables that add value to the model, such as
calculating "days until departure" or identifying seasonal trends, which can improve the
overall predictive capability of the models.The Exploratory Data Analysis (EDA) stage is
critical for understanding the underlying structure and characteristics of the dataset. Through
statistical analysis and visualizations, such as histograms, scatter plots, and heatmaps, key
patterns and relationships are identified. For instance, trends in how ticket prices fluctuate
based on airline, route, or day of the week may emerge. EDA also helps detect outliers and
informs decisions regarding feature selection, guiding the process of choosing the most
relevant factors for the predictive model.

13
CHAPTER 2
LITERATURE REVIEW

Paper [1] Focuses On Clustering-Based Modular Approach: The methodology involves


segmenting the dataset into clusters based on similar characteristics (e.g., flight routes,
weather conditions). This clustering is intended to simplify the problem by dealing with more
homogeneous subsets of data.Deep Neural Networks (DNNs): After clustering, a modular
approach is applied where each cluster is fed into a dedicated deep neural network model. The
DNNs are trained to learn the specific patterns and relationships within each cluster, leading
to more accurate predictions.Modular Integration: The predictions from each DNN are then
integrated to provide a final prediction for the flight arrival time. This modular integration
helps in leveraging the strength.

Paper [2] Focuses On Multi-Scale Temporal Convolutional Networks (TCNs): The model
uses multiple TCNs to handle flight data with varying sampling frequencies, allowing it to
capture detailed temporal patterns in different flight parameters.
Interpretability with Class Activation Mapping (CAM): A key innovation in IMTCN is the
adaptation of CAM, commonly used in image classification, to provide interpretability in
flight safety predictions. This allows the model to highlight specific parameters and moments
in the flight data that contribute most to potential safety incidents.
Performance: The model has been tested on a real-world dataset of 37,943 Airbus A320
flights, where it outperformed baseline models in predicting exceedances (safety incidents) 2
to 4 seconds in advance. The case studies included in the research demonstrate the model's
ability to offer clear insights into the causes of these incidents, which is crucial for both airline
operators and pilots.

Paper [3] Focuses On Data Collection: The study likely uses historical flight data, including
features such as departure and arrival times, weather conditions, aircraft type, and other
relevant variables that could influence flight delays.
Feature Selection: The paper emphasizes the importance of selecting the most relevant
features that significantly impact delay predictions. This may involve statistical methods,
domain knowledge, and feature engineering techniques.

14
Paper [4] Focuses On Fracture Mechanics-Based Models: Use the principles of fracture
mechanics to model crack propagation. Examples include Paris’ Law, which relates crack
growth rate to the range of stress intensity factors. Bayesian Methods Incorporate uncertainty
and prior knowledge into predictions, providing a probabilistic estimate of flaw growth.
Machine Learning Algorithms: Techniques such as Support Vector Machines (SVM), Neural
Networks, and Random Forests are used to predict flaw growth by learning patterns from
historical data.Finite Element Analysis (FEA): Used to simulate the stress distribution in
aircraft wings and predict how flaws will grow under different conditions.

Paper [5] Focuses On Neural Decomposition: The paper likely proposes a method using
neural networks (such as LSTMs, GRUs, or more sophisticated architectures) to automatically
learn and decompose these components from time-series data.Architecture: The architecture
could include separate neural networks or modules for each component (e.g., one for trend,
one for seasonality) or a unified model that learns these components simultaneously.Effective
Generalization: Generalization refers to the model's ability to perform well on unseen data,
not just the data it was trained on. By decomposing the time-series data into more predictable
components, the model can generalize better to new datasets or future data points.Transfer
Learning: The model might leverage knowledge from one dataset (source) and apply it to
another (target), enhancing its ability to generalize across different time-series datasets.

15
CHAPTER 3
PROJECT DESCRIPTION
3.1 EXISTING SYSTEM

The existing system for tracking airline ticket prices typically involves manual searches or
reliance on third-party services and websites. These systems often provide limited historical
data and can be inefficient for users who need real-time updates and comprehensive analytics.
The current landscape for tracking and analyzing airline ticket prices primarily relies on a
combination of manual searches, third-party comparison websites, and services offered by
airlines themselves. While these systems provide a basic level of functionality, they come
with several limitations

3.2 PROPOSED SYSTEM

The proposed system aims to automate the process of collecting and analyzing airline ticket
prices using Selenium and Python. This system will leverage web scraping to automatically
gather price data from multiple airline and travel websites at regular intervals, storing this
data in a structured format within a database. With Python's powerful data analysis libraries
such as Pandas, NumPy, and Matplotlib, the system will provide comprehensive analytics
and visualizations, enabling users to identify trends, forecast future prices, and make
informed purchasing decisions. A user-friendly web interface developed with Flask or
Django will allow users to easily query data, view insights, and set up dynamic alerts for
price drops. capturing the images. It also comprises of many complex processes like
face, feature extraction etc. we can perform all these complex processes with less efficient
webcams. It Easily detects the log of objects. Processing rate is higher. When there is a high
processing rate the result will be efficient

3.2.1 Advantages

Time-Saving: Automated data collection eliminates the need for manual searches, saving
time for users.

Real-Time Updates: Users receive the most current pricing information.

All-in-One View: Aggregates data from multiple sources, providing a comprehensive view
of the market.

16
Historical Insights: Analyzes historical data to identify trends and forecast future prices.

Cost Savings: shows cost reduction option

Best Prices: Helps users find the cheapest flights available.

Dynamic Alerts: Users can set up alerts for price drops, ensuring they purchase tickets at the
optimal time.

Adaptable: Easily scalable to include more airlines and routes as needed.

3.3 FEASIBILITY STUDY


The economic feasibility of the project shows that while there are initial development and
operational costs, the potential cost savings for users and revenue opportunities from
premium features or affiliate marketing outweigh these expenses. Technically, the project is
feasible given the availability of robust tools and libraries such as Selenium for web scraping
and Python for data analysis. Socially, the system is likely to be well-received due to the high
demand for affordable travel options among frequent travelers and businesses, with its user-
friendly interface and valuable insights positively impacting user acceptance.

3.3.1 ECONOMIC FEASIBILITY


Cost Analysis:
Development Costs: Initial investment in developing the system, including software,
hardware, and personnel.Maintenance Costs: Ongoing costs for maintaining the system,
updating software, and managing data storage.Operational Costs: Costs associated with
running web servers, databases, and network infrastructure.

Benefit Analysis:

Savings for Users: Significant cost savings for users by identifying cheaper flights.

Revenue Opportunities: Potential revenue from premium features, advertising, or affiliate


marketing. Overall, the economic benefits, primarily user cost savings, outweigh the
development and operational costs, making the project economically feasible.

17
3.3.2 TECHNICAL FEASIBILITY

TECHNOLOGY STACK

Programming Languages: Python for scripting and data analysis.

Libraries and Tools: Selenium for web scraping, BeautifulSoup for parsing HTML, Pandas
for data manipulation, and Matplotlib/Seaborn for visualization.

Database Management: MySQL or PostgreSQL for storing collected data.

Web Frameworks: Flask or Django for developing the user interface.

3.3.3 SOCIAL FEASIBILITY

User Acceptance:

Usability: Intuitive interface and clear visualizations ensure user-friendliness.

Market Demand: High demand for affordable travel options among frequent travelers,
students, and businesses.

Impact:

Positive Impact: Empowering users with data-driven insights leads to better purchasing
decisions.

Community Building: Potential for creating a community of users sharing tips and alerts.

3.4 SYSTEM SPECIFICATIONS


3.4.1 HARDWARE SPECIFICATIONS
Processor: Intel i5 or higher for development and data processing tasks.
RAM: Minimum 8 GB, preferably 16 GB for handling large datasets.
Storage: 256 GB SSD or higher to ensure fast read/write operations.
Internet Connection: High-speed broadband for efficient web scraping and data
transmission.
Servers: Cloud-based servers (e.g., AWS, Azure) for deployment and scalability

18
3.4.2 SOFTWARE SPECIFICATIONS
Operating System: Compatible with Windows, macOS, or Linux.
Programming Language: Python 3.x for scripting and development.
LIBRARIES:
Selenium: For web scraping and browser automation.

BeautifulSoup: For parsing and extracting data from HTML.

Pandas and NumPy: For data manipulation and analysis.

Matplotlib and Seaborn: For creating visualizations.

Scikit-Learn: For machine learning and predictive analytics (optional).

Database: MySQL or PostgreSQL for structured data storage.

Web Framework: Flask or Django for developing the user interface.

19
CHAPTER 4
PROPOSED WORK

The Airline Ticket Price Analysis project aims to predict airline ticket prices based on
various factors such as the date of booking, time of travel, destination, and airline company.
Ticket prices are highly volatile and depend on several features, including seasonality,
demand, and the timing of the booking relative to the flight. By building a predictive model
using machine learning algorithms like Random Forest and Decision Tree Regression, this
project will provide insights into ticket pricing patterns, allowing users to make informed
decisions about when to book flights. The project leverages tools such as Python, Selenium,
NumPy, Pandas, and Power BI to scrape, process, and visualize data for analysis.

4.1 GENERAL ARCHITECTURE

Figure 4.1: Architecture Diagram

Figure 4.1 provided architecture diagram outlines the workflow of a flight fare prediction
system, illustrating the various stages involved in building and deploying a machine learning
model. The system begins with the collection of a flight fare dataset, which serves as the
foundation for the analysis.Data preprocessing is a crucial step to ensure data quality and
consistency. This involves tasks such as handling missing values, removing outliers, and

20
transforming data into a suitable format for machine learning algorithms. The preprocessed
data is then subjected to feature selection to identify the most relevant features that
contribute significantly to predicting flight fares.The selected features are divided into
training and testing datasets. The training data is used to train the machine learning model,
while the testing data is used to evaluate its performance.

A random forest algorithm is employed as the machine learning model, known for
its ability to handle complex relationships and avoid overfitting.
The model is trained on the training data, learning patterns and relationships between the
features and the target variable (flight fare). After training, the model's performance is
evaluated on the testing data to assess its accuracy and generalization capabilities. Metrics
such as mean squared error (MSE) or root mean squared error (RMSE) can be used to
quantify the model's prediction errors.If the model's performance is satisfactory, it can be
deployed into a production environment for real-time flight fare predictions. The deployed
model can be integrated with a user interface (UI) that allows users to input relevant flight
details and receive accurate fare predictions.

4.2 DESIGN PHASE

4.2.1 DATA FLOW DIAGRAM

Figure 4.2: Data Flow Diagram

21
Figure 4.2 represents the flow diagram of our project. The diagram illustrates a logical data
flow diagram (DFD) depicting the process of customer purchasing items at a retail store.
The DFD outlines the sequence of steps involved, starting from the customer's initial
intention to purchase and culminating in the settlement of the transaction and issuance of a
payment receipt.The process begins with the customer, who provides a list of items they
wish to purchase. This information is then transmitted to the "Identify Item" process, where
the items are verified and their corresponding item IDs are retrieved. Subsequently, the item
IDs are sent to the "Look up Prices" process, where the prices associated with each item are
obtained from the "Prices" data store.

4.2.2 UML DIAGRAM

Figure 4.3: Uml Diagram

Figure 4.3 represent the UML diagram of our model. The UML diagram presents a
comprehensive model for an airline ticket price analysis system. It includes entities such as
flights, airlines, airports, passengers, and accounts, along with their relationships. The
diagram also defines data types and enumerations to represent specific values like flight

22
status, payment status, and seat class.
4.2.3 USE CASE DIAGRAM

Figure 4.4: Use Case Diagram

Figure 4.4 represents the Use Case diagram of our model. The provided diagram illustrates a
use case diagram depicting the interactions between various actors and use cases within a
travel agency system. The use case diagram provides a high-level overview of the system's
functionality and the relationships between different actors and their roles.
The central use case, "Book a tour," represents the core functionality of the system. This use
case involves multiple steps, including "Book airline ticket," "Reserve a seat," and "Arrange
tour." The "Book airline ticket" and "Arrange tour" use cases are further decomposed into
sub-use cases, indicating that they involve additional activities or steps. The "Customer"
actor plays a key role in the system, interacting with various use cases. They can "Book a
tour," "Pay travel agent," and ultimately "Pay for tour." The "Travel agent" actor is
responsible for facilitating the booking process, including "Booking airline tickets,"
"Delivering airline tickets," and receiving "Travel agent commission."

23
4.2.4 SEQUENCE DIAGRAM

Figure 4.5: Sequence Diagram

Figure 4.5 represents sequence diagram, The sequence diagram illustrates the process of
booking a flight through a website. The user begins by searching for flights on their device,
which triggers a redirection to the server. The user then enters their flight requirements, and
the server searches the website database for available options. Once the user selects a flight,
they are prompted to enter their payment information. After successful payment, the user
receives a confirmation and can print or download a copy of their ticket. The diagram
highlights the interaction between the user, device, server, website, and website database
throughout the booking process.

4.3 PROPOSED MODULES


 Machine Learning models
 Random Forest Regressor
 Decision Tree Regressor
 Future selection and engineering

24
MACHINE LEARNING MODELS
The airline dataset included the following eight characteristics:
Departure and arrival times, type of airline, number of stoppages, source,
destination and additional information. We are performed prediction using regression
Machine Learning models that including, LGBM Regressor, Random Forest Regression
Tree, and Decision Tree Regressor. Random Forest Regression is a supervised learning
algorithm that uses ensemble learning method for regression. Ensemble learning method is
a technique that combines predictions from multiple machine learning algorithms to make
a more accurate prediction.

RANDOM FOREST REGRESSOR


Random Forest regressor uses multiple decision trees to perform regression tasks.
Random forest is a Supervised Learning algorithm which uses ensemble learning approach
for classification and regression. Decision trees are sensitive to the specific data on which
they are trained. If the training data is changed the resulting decision tree can be quite
different and in turn the predictions can be quite different. Decision trees are
computationally expensive to train, carry a big risk of overfitting, and tend to find local
optimal because they can’t go back after they have made a split to address these
weaknesses, we turn to Random Forest. Ensemble learning models work just like a group
of diverse experts teaming up to make decisions

DECISION TREE REGRESSOR


Decision tree regressor is a class in Skylearn tree module that can be used to train and
predict regression models. It is a decision tree based algorithm that recursively partitions
the input data based on the values of the input features, forming a tree-like structure.
It initially chooses independent variable from dataset as decision nodes for decision
making and then it divides the entire dataset into sub-sections.when test data is passed to
the model the output is decided by checking the data point belongs to the decision tree will
give output as the average value of all the data points in the sub-section which are known
for their ability to handle complex data, reduce overfitting

25
FUTURE SELECTION AND ENGINEERING
In this step the features of our model are extracted and all the relevant features are used for
model training. In dataset it contains date of journey, arrival date, departure date columns
and all the numerical values are extracted as Departure hour, departure minutes, arrival
hour, arrival minutes, journey day, journey month. As dataset contains both categorical and
numerical features, by using 'One hot encoding' method for nominal categorical data and
'label encoding' for ordinal categorical data was used to convert the categorical values to
numerical values.The dataset consists of categorical variables like airline, source,
destination, route, total number of stops and additional info

26
CHAPTER 5
DATASETS

5.1 BUSSINESS CLASS TICKETS DATASETS

Figure 5.1: Business Class Dataset

5.2 ECONOMY CLASS TICKETS DATASETS

Figure 5.2 Economy Class Data Set

27
5.3 OVERALL FLIGHTS DATASETS

Figure 5.3 Indian Airline Data Set

Figure 5.1, Figure 5.2 & Figure 5.3 The dataset will be created by scraping web content from
different travel websites.The Search Engine Results - Flights & Tickets Keywords Dataset
will also be used as it provides Rankings for world top destinations on Google.The dataset
undergoes rigorous preprocessing, involving data cleaning, normalization, and feature
selection to ensure its quality and relevance.

5.4 BUILDING THE MODEL

After preparing and splitting the data, various machine learning models are built to predict
airline ticket prices. For this project, Regression techniques (such as Linear Regression,
Random Forest Regressor, and Decision Tree Regressor) are employed to model the
relationships between input features (e.g., departure location, destination, travel date, and
booking time) and the output variable (ticket price). Python’s scikit-learn library provides the
tools to create these models, and GridSearchCV or other optimization techniques can be used
to find the best model parameters. The choice of model and hyperparameter tuning is critical to

balancing accuracy, interpretability, and performance.

28
5.5 TESTING THE MODEL

Once the models have been built, they are evaluated on the test dataset to assess their
predictive performance. This step ensures that the model performs well on unseen data and
generalizes beyond the training dataset. Performance metrics such as Mean Absolute Error
(MAE), Root Mean Square Error (RMSE), and R-squared are computed to gauge the
accuracy of the predictions. Visualization tools like Matplotlib and Seaborn can be used to
plot predicted vs actual prices, revealing how well the model captures the underlying
patterns in ticket prices. Any underperforming models may be revisited for additional
feature engineering or model tuning to improve accuracy

5.5 IMPLEMENTING THE MODEL.

Once a satisfactory model is identified, it can be implemented into a user-friendly system. In


this case, the model's results may be visualized using Power BI, a powerful business
analytics tool that transforms data into interactive dashboards and reports. Users can explore
predictions based on different parameters such as destination, travel date, or booking time,
helping consumers or businesses make informed decisions about ticket purchasing.
Additionally, the model may be deployed in a web-based application or integrated with APIs
for real-time price prediction, making it a practical tool for airlines or travel agencies

29
CHAPTER 6
IMPLEMENTATION AND TESTING

This code is a comprehensive approach to Airline Ticket Price Analysis, utilizing Selenium for
web scraping, followed by extensive data manipulation, cleaning, and visualization using
Pandas, Matplotlib, and Seaborn libraries. The purpose of this analysis is to gain deeper
insights into airline pricing trends, providing a better understanding of factors that influence
ticket prices, such as travel dates, airline selection, travel class, stops, and even the duration of
the flight. This can be helpful for both travelers looking to optimize their bookings and for
businesses in the travel industry interested in market analysis.
Step 1: Web Scraping Using Selenium
Once the results are loaded, the script loops through each flight result using the find_elements()
function, locating details such as airline name, price, duration, departure time, arrival time,
number of stops, travel class, and cities of origin and destination. This information is stored
in a Python list, which is later converted into a Pandas DataFrame for easier manipulation and
analysis.

Figure 6.1: Web Scraping Using Selenium

30
Step 2: Data Manipulation and Cleaning
After scraping the required data, the script uses Pandas to structure the data and prepare it for
analysis. The raw data collected from the webpage is stored in a DataFrame, and one of the
critical steps in this phase is cleaning the price column. Since prices usually contain currency
symbols (such as "$"), the code uses replace() to remove these symbols, allowing for easier
numerical manipulation. The price data is then converted to a float type for subsequent analysis.
The cleaned and formatted data is saved into a CSV file for future use and further analysis.

Figure 6.2:Data Manipulation

Step 3: Data Analysis and Visualizations


This section of the code focuses on data exploration and visualization using Matplotlib and
Seaborn libraries to present key trends and patterns in the airline ticket prices. Below are some
of the key visualizations created:
Ticket Price Range by Class of Travel: This visualization displays how ticket prices vary
according to the travel class, such as Economy, Business, or First Class. A strip plot is used to
show the distribution of ticket prices across different travel classes, which highlights how much
more expensive premium classes are compared to economy fares.

31
Figure 6.3: Price Range According To Class Of Travel

Ticket Availability by Class: This count plot provides insights into the availability of tickets
across different travel classes. It is essential to understand if certain classes (like economy) are
more prevalent, while higher-end classes like business or first class are less common.

Figure 6.4: Ticket Availability

32
Price vs. Flight Duration for Different Airlines: Using a scatter plot, this visualization
demonstrates the relationship between the duration of the flight and the ticket price. Longer
flights tend to be more expensive, but this can vary by airline. It’s especially useful in
identifying airlines that may offer cheaper long-haul flights or premium short-haul services.

Figure 6.5: Price Vs Airlines.

Price of Tickets Based on Time of Departure and Arrival: This set of box plots displays the
variation in ticket prices based on the time of departure and arrival. It’s common for flights at
certain times of day (such as early morning or late at night) to be cheaper, and this visualization
helps identify such trends.

Figure 6.6 :Prices Of Ticket Based On Time

33
Airline Ticket Prices Based on Days Left Before Buying: A crucial insight derived from the
analysis is the influence of booking timing on ticket prices. This regression plot shows how
ticket prices change depending on how early the tickets are booked. The trend suggests that
prices tend to increase as the travel date approaches, reinforcing the well-known strategy of
booking in advance for lower prices.

Figure 6.7 : Airline Ticket Prices Based On Days Left Before Buying

 Price of Tickets Based on Time of Departure and Arrival: This set of box plots displays the
variation in ticket prices based on the time of departure and arrival. It’s common for flights at
certain times of day (such as early morning or late at night) to be cheaper, and this visualization
helps identify such trends.

Figure 6.8: Price Of Tickets Based On Time Of Departure And Arrival

34
CHAPTER 7
RESULTS AND DISCUSSIONS

7.1 EFFICIENCY OF THE PROPOSED SYSTEM

The proposed system for airline ticket price analytics using Selenium and Python has proven
to be highly efficient compared to traditional manual methods and existing automated
solutions. By integrating web scraping with Selenium, the system automates the real-time
collection of data from multiple airline websites, drastically reducing the time and effort
required for data gathering. This automation has led to significant time savings, as tasks that
would take hours manually are now completed in minutes. Additionally, the system ensures
high data accuracy by scraping information directly from airline websites, minimizing errors
associated with outdated or manually entered data. The comprehensive coverage of multiple
airlines and routes surpasses the limitations of third-party comparison websites, offering a
broader market view. Furthermore, the system's modular design allows for easy scalability,
enabling the addition of new airlines or routes with minimal changes to the existing
codebase. The use of Pandas, Seaborn, and Matplotlib for data analysis and visualization
provides clear and interpretable insights into price trends, enhancing the user's ability to
make informed decisions quickly.

7.2 COMPARISON OF EXISTING AND PROPOSED SYSTEM

The proposed system offers several advantages over existing methods, both manual and
automated. In terms of data collection, manual methods are time-consuming and error-prone,
while third-party websites have limited coverage. The proposed system, however, uses
Selenium to automate data collection, providing real-time, comprehensive, and accurate data
directly from airline websites. For data analysis, manual methods offer limited capabilities,
and third-party websites typically provide only basic comparisons. In contrast, the proposed
system leverages Pandas and NumPy for advanced data analysis, allowing for in-depth
exploration of price trends and forecasting. When it comes to visualization, manual
presentations are often basic, and third-party websites lack customization.

35
CHAPTER 8
CONCLUSION AND FUTURE ENHANCEMENTS

8.1 CONCLUSION
The proposed system for airline ticket price analytics using Selenium and Python has proven
to be a highly effective solution, significantly improving the efficiency and accuracy of data
collection, analysis, and visualization. By automating the real-time gathering of data from
multiple airline websites, the system reduces the time and effort required for these tasks,
while ensuring data accuracy and comprehensive market coverage. The use of advanced
tools such as Pandas, NumPy, Seaborn, and Matplotlib allows for in-depth analysis and clear
visual representation of price trends, helping users make well-informed purchasing
decisions. The system's modular and scalable design further enhances its robustness and
adaptability.

8.2 FUTURE ENCHANCEMENT


The current implementation doesn’t explore the incorporation of external factors such as
economic indicators, geopolitical events, and market dynamics.
These factors will be incorporated in the future to improve the model's adaptability to real-
world scenarios.Advanced hyper-parameter tuning techniques and optimization algorithms
will be investigated and integrated in the current scheme to streamline the process of fine-
tuning the ensemble models, reducing computational requirements.Reducing time
complexity and time complexity.It is also planned to develop mechanisms for dynamic data
updates, allowing the model to adapt changing market conditions and ensuring consistently
accurate predictions.A feedback mechanism can also be integrated that allows users to
provide feedback on predicted fares, enabling continuous improvement of the model based
on user experience

36
CHAPTER 9
SOURCE CODE

from selenium import webdriver


from selenium.webdriver.common.by
import By from
selenium.webdriver.common.keys import
Keys import time
import pandas as pd
# Initialize the WebDriver
driver = webdriver.Chrome(executable_path='path/to/chromedriver') # Update path
# Open the airline search website
url = "https://www.example-airline-
website.com" driver.get(url)
# Interact with the search form
origin = driver.find_element(By.ID, 'origin')
destination = driver.find_element(By.ID,
'destination')
departure_date = driver.find_element(By.ID,
'departure-date') return_date =
driver.find_element(By.ID, 'return-date')
# Fill in the form
origin.send_keys('New York')
destination.send_keys('San
Francisco')
departure_date.send_keys('2024
-10-01')
return_date.send_keys('2024-
10-15')
search_button = driver.find_element(By.ID,
'search-button') search_button.click()

37
# Wait for results
toload
time.sleep(10)
#Extract
flight data
flights = []
flight_elements = driver.find_elements(By.CLASS_NAME, 'flight-class') # Update class
name accordingly
for flight in flight_elements:

airline = flight.find_element(By.CLASS_NAME, 'airline-


name').text price = flight.find_element(By.CLASS_NAME,
'price').text duration =
flight.find_element(By.CLASS_NAME, 'duration').text
departure_time = flight.find_element(By.CLASS_NAME, 'departure-
time').text arrival_time = flight.find_element(By.CLASS_NAME, 'arrival-
time').text
stops = flight.find_element(By.CLASS_NAME, 'stops').text
travel_class = flight.find_element(By.CLASS_NAME, 'travel-class').text
source_city = flight.find_element(By.CLASS_NAME, 'source-city').text
destination_city = flight.find_element(By.CLASS_NAME, 'destination-
city').text
'airline': airline,
'price': price,
'duration': duration,
'departure_time': departure_time, 'arrival_time': arrival_time
'stops': stops,
'class': travel_class,
'source_city': source_city,
'destination_city': destination_city
})

# Close the
WebDriver
driver.quit()

38
# Convert to DataFrame
df = pd.DataFrame(flights)
# Clean up price column by removing currency symbols and converting to
float df['price'] = df['price'].replace('[\$,]', '', regex=True).astype(float)
# Save the DataFrame to a CSV file for future use df.to_csv(r"C:\Users\heyit\
Desktop\Airline Data\Dataset\Indian Airlines.csv", index=False)
# Proceed with analysis as before

Data Analysis and Visualization


import pandas as
pd import numpy
as np
import
matplotlib.pyplot as
plt import seaborn as
sns
import warnings
warnings.filterwarnings('ign
ore')
# Load the scraped data
df = pd.read_csv(r"C:\Users\heyit\Desktop\Airline Data\Dataset\Indian Airlines.csv")
# Display the first
10 rows
print(df.head(10))
# Display the number of unique values for each column
print(df.nunique())
# Display unique values for each
categorical column for col in df:
if df[col].dtype == 'object': print(f"{col}:
{df[col].unique()}"emindhimaakkkkkna
print(f"{col}: {df[col].unique()}")

# Visualization 1: Number of flights by airline plt.figure(figsize=(15,5))


NF = sns.countplot(x='airline', data=df)
NF.set(xlabel='Airline in India', ylabel='No. of flights', title='No. of flight by
Airlines') plt.show()

39
# Visualization 2: Price range according to
class of travel plt.figure(figsize=(15,5))
CE = sns.stripplot(x='price', y='class', data=df)
CE.set(xlabel='Ticket cost', ylabel='Class of Travel', title='Price range according
to Class of Travel')
plt.show()
# Visualization 3: Availability of tickets according to
class of travel plt.figure(figsize=(15,5))
TA = sns.countplot(x='class', data=df)
TA.set(xlabel='Class of Travel', title='Availability of Tickets according to Class
of Travel') plt.show()
# Visualization 4: Price vs. duration of flight for
different airlines plt.figure(figsize=(15,5))
PD = sns.scatterplot(df['duration'], df['price'], hue=df['airline'])
PD.set(xlabel='Duration of flight', ylabel='Price of Ticket', title='Price Vs Duration
of Flight for different Airlines')
plt.show()

# Visualization 5: Economy vs Business ticket prices by airlines


plt.figure(figsize=(15,5))
AS = sns.barplot(x='airline', y='price', hue='class', data=df.sort_values('price'))
AS.set(xlabel='Airlines in India', ylabel='Price of Ticket', title='Economy Vs
Business Ticket Prices by Airlines')
plt.show()

Visualization 6: Airline ticket prices based on days left before buying the ticket

df_temp = df.groupby(['days_left'])['price'].mean().reset_index()
plt.figure(figsize=(15,5))
ax = plt.axes()
sns.regplot(x=df_temp.loc[df_temp['days_left'] == 1].days_left,
y=df_temp.loc[df_temp['days_left'] == 1].price, data=df_temp, fit_reg=False,
ax=ax) sns.regplot(x=df_temp.loc[(df_temp['days_left'] > 1) &
(df_temp['days_left'] < 20)].days_left, y=df_temp.loc[(df_temp['days_left'] > 1) &
(df_temp['days_left'] < 20)].price, data=df_temp, fit_reg=True, ax=ax)

40
sns.regplot(x=df_temp.loc[df_temp['days_left'] >= 20].days_left,
y=df_temp.loc[df_temp['days_left'] >= 20].price, data=df_temp, fit_reg=True,
ax=ax) ax.set(xlabel='Tickets booked before X days', ylabel='Price of Ticket',
title='Airline ticket prices based on days left before buying the ticket')
plt.show()

# Visualization 7: Average price depending on


duration of flight df_temp2 = df.groupby(['duration'])
['price'].mean().reset_index() plt.figure(figsize=(15,5))
PD = sns.scatterplot(x='duration', y='price', data=df_temp2)
PD = sns.regplot(x='duration', y='price', data=df_temp2, order=2)
PD.set(xlabel='Duration of flight', ylabel='Price of Ticket', title='Average price
depending on duration of flight')
plt.show()
# Visualization 8: Price of ticket depending on time of departure
and arrival plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.boxplot(data=df, x='departure_time', y='price',
showfliers=False).set(xlabel='Departure Time', ylabel='Price of Ticket',
title='Price of Ticket depending on time of departure') plt.subplot(1,2,2)
sns.boxplot(data=df, x='arrival_time', y='price',
showfliers=False).set(xlabel='Arrival Time', ylabel='Price of Ticket', title='Price
of Ticket depending on time of arrival')
plt.show()

# Visualization 9: Airline ticket prices based on the source and destination cities
ax = sns.relplot(x='destination_city', y='price', col='source_city', col_wrap=3,
kind='line', data=df)
ax.fig.subplots_adjust(top=0.9)
ax.fig.suptitle('Airline ticket prices based on the source and destination
cities') plt.show()

# Visualization 10: Price of airline tickets based on number of stops in economy


and business class

41
fig, axs = plt.subplots(1,2, gridspec_kw={'width_ratios': [3,1]},
figsize=(15,5)) sns.barplot(y='price', x='airline', hue='stops',ata=df.loc[df['class']
== 'Economy'].sort_values('price', ascending=False), ax=axs[0])
axs[0].set(xlabel='Airlines', ylabel='Price of Ticket', title='Price of Airline tickets based
on No. of Stops in Economy Class')
sns.barplot(y='price', x='airline', hue='stops', data=df.loc[df['class'] ==
'Business'].sort_values('price', ascending=False), ax=axs[1])
axs[1].set(xlabel='Airlines', ylabel='Price of Ticket', title='Price of Airline tickets based
[1]
on No. of Stops in Business Class')
plt.show()

[2]

[3]
[4]
[5]
[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]
REFERENCES

[15] 42

[16]
A Flight Arrival Time Prediction Method Based on Cluster Clustering-Based Modular With
Deep Neural Network A Flight Arrival Time Prediction Method Based on Cluster Clustering-
Based Modular With Deep Neural Network [1]
"IMTCN: An Interpretable Flight Safety Analysis and Prediction Model Based on Multi-
Scale Temporal Convolutional Networks"[2]
Selection of Best Machine Learning Model to Predict Delay in Passenger Airlines [3]
Prognostic Algorithms for Flaw Growth Prediction in an Aircraft Wing [4]
Neural Decomposition of Time-Series Data for Effective Generalization [5]
Selenium Documentation: Provides comprehensive details on using Selenium for web
scraping and automation. Available at: https://www.selenium.dev/documentation/
Pandas Documentation: Offers an extensive guide on using Pandas for data manipulation
and analysis. Available at: https://pandas.pydata.org/docs/
NumPy Documentation: Details the usage of NumPy for numerical operations and
advanced data analysis. Available at: https://numpy.org/doc/
Seaborn Documentation: Covers the usage of Seaborn for statistical data visualization
Available at: https://seaborn.pydata.org/
Matplotlib Documentation: Provides information on creating static, animated, and
interactive visualizations in Python. Available at: https://matplotlib.org/stable/contents.html
Python Official Documentation: General reference for Python programming, covering
various modules and libraries. Available at: https://docs.python.org/3/
Web Scraping with Python by Ryan Mitchell: A comprehensive book on web scraping
techniques using Python and libraries like Selenium.
Machine Learning Yearning by Andrew Ng: A practical guide on how to structure
machine learning projects, useful for integrating predictive analytics into the system.
API Integration with Python by Gergely Szerovay: A resource on integrating various
APIs, useful for expanding data sources and functionalities.
Python for Data Analysis by Wes McKinney: A foundational book on data analysis using
Python, covering tools like Pandas and NumPy.
Real-World Python by Lee Vaughan: Contains practical examples of Python
applications, including web scraping and data analysis.

43

You might also like