Thesis Defense
Thesis Defense
Price
Prediction
Insights
• Total Operating Revenue:
⚬ India's aviation industry generated approximately $20 billion in revenue.
• Revenue Passenger Kilometers (RPK):
⚬ In 2019, Indian airlines achieved around 265 billion RPK, reflecting
significant passenger demand and airline activity.
• Passenger Traffic:
⚬ Domestic airlines in India carried approximately 144 million passengers
in 2019, showcasing the growing preference for air travel
Page 5
1
Stakeholders
Consumers 2 3
• Cost Savings: Helps consumers
identify the best times to purchase Airline Benefits:
tickets at the lowest prices. • Revenue Management: Assists airlines in
setting optimal prices to maximize revenue. Market Dynamics:
• Budget Planning: Enables travelers to
• Demand Forecasting: Improves airlines' ⚬ Enhances market competitiveness
plan their expenses more effectively.
ability to predict passenger demand and by providing transparent pricing.
adjust pricing strategies accordingly. ⚬ Encourages more informed
decision-making for both
consumers and airlines.
Page 6
Data Processing
• Importing Datasets:
• Collected datasets from various sources, including airline websites and travel agencies.
• Ensured datasets included relevant features such as date of journey, departure time, arrival time, and
ticket prices.
• Handling Missing Values:
• Identified and addressed missing data points using imputation methods to maintain dataset integrity.
• Applied techniques such as mean imputation for numerical data and mode imputation for categorical
data.
• Date and Time Conversion:
• Converted date and time columns into datetime format for consistent processing.
• Extracted valuable time-based features such as day of the week, month, and hour of travel.
• Data Normalization:
• Standardized numerical features to ensure uniform scaling across the dataset.
• Used normalization techniques to improve the performance of machine learning models.
Page 6
Nominal data -- Data that are not in any order -->one hot
encoding
1 2
Date and Time Extraction:
⚬ Extracted day and month from the date of journey to Duration Calculation:
⚬ Calculated the duration of each flight by subtracting departure
capture seasonal and monthly trends.
⚬ Extracted hour and minute from departure and arrival time from arrival time.
⚬ Converted the duration into a numerical feature representing
times to analyze the impact of travel time on ticket
prices. total travel time in minutes or hours.
Page 11
Feature Engineering
• Airline Encoding:
⚬ Encoded the airline names to capture airline-specific
1
pricing strategies.
⚬ Used label encoding or one-hot encoding for
transforming airline data into numerical features.
• Route Encoding:
2 3
⚬ Encoded the route (combination of source and Additional Features:
destination) to identify route-specific pricing • Considered adding features such as layovers, number
patterns. of stops, and seat class.
⚬ Applied one-hot encoding to convert categorical • Evaluated the importance of each feature using feature
route data into numerical format. importance metrics from machine learning models.
Model Selection
Page 12
1
2
3
4
5
6
....why so?
• Performance Metrics:
⚬ Accuracy: Achieved an accuracy rate of 85.36%, indicating a high level of predictive performance.
⚬ Mean Absolute Error (MAE): Low MAE values, reflecting the model's precision in predicting
ticket prices.
⚬ Root Mean Squared Error (RMSE): Demonstrated the model's ability to handle variability in the
data with minimal error.
Page 6
....why so?
• Feature Importance:
• Identified key features contributing to accurate predictions, such as departure time,
duration, and airline.
• Provided insights into which factors most significantly influence ticket prices.
• Model Robustness:
• Validated the model's robustness through cross-validation and testing on unseen data.
• Ensured consistent performance across different datasets and conditions.
Page 18
Hypertunning The Model
Objective:
Optimize the performance of the RandomForest Regressor by fine-tuning its hyperparameters.
Parameters Tuned:
Number of Trees (n_estimators): Adjusted the number of decision trees in the forest to balance between overfitting
and underfitting.
Maximum Depth (max_depth): Set limits on the depth of the trees to prevent overfitting.
Minimum Samples Split (min_samples_split): Determined the minimum number of samples required to split an
internal node.
Minimum Samples Leaf (min_samples_leaf): Established the minimum number of samples required to be at a leaf
node.
Optimization Techniques:
RandomizedSearchCV: Utilized to efficiently search through a wide range of hyperparameters by randomly
sampling from the specified distributions.
GridSearchCV: Employed to perform an exhaustive search over a predefined grid of hyperparameters to find the best
combination.
Page 19
Hypertuning The Model
• Results:
⚬ Achieved improved accuracy and reduced error rates with optimized
hyperparameters.
⚬ Fine-tuned model demonstrated enhanced generalization on unseen data.