Data Analysis Project Report
Data Analysis Project Report
Project Report On
Data Analysis On Hotel Booking Dataset
SUBMITTED BY:
Subhranshu Sekhar Mallick
Roll No:2004000629144***
INTERNATIONAL INSTITUTE OF
MANAGEMENT & TECHNOLOGY
(Recognized By Dept. of Higher Education, Govt. of Odisha and Utkal University)
(K8-682 Kalinganagar,Bhubaneswar - 751003)
[2020-2023]
1
CONTENTS
● Certificate
● Abstract
● Acknowledgement
● Declaration
● Introduction
● Objective
● Review Of Literature
● Methodology
● Business Problem
● Assumptions
● Research Question
● Hypothesis
2
List Of Figures
3
CERTIFICATE
This is to Certify that project work entitled “Data Analysis On Hotel Booking Dataset” is a
bonafide work carried out in the 6th semester by Subhranshu Sekhar Mallick in partial
fulfillment for the award of BCA - Bachelor In Computer Application Bhubaneswar
during the academic year 2020 - 2023.
Project Guide
4
ABSTRACT
This abstract focuses on the use of data analysis to minimize hotel booking cancellations.
The hotel industry is highly competitive, and cancellations can have a significant impact
on a hotel's revenue. Data analysis techniques can be used to understand the factors that
contribute to cancellations and to develop strategies to minimize them. Based on the
insights gained from the exploratory data analysis(EDA), the study proposes several
strategies to minimize hotel booking cancellations. These strategies include optimizing
the booking process, improving the guest experience, and implementing flexible
cancellation policies. The study also recommends the use of data-driven decision-making
to continually monitor and adjust these strategies to ensure their effectiveness. In
conclusion, the use of data analysis can provide valuable insights into the factors that
contribute to hotel booking cancellations and enable hotels to develop effective strategies
to minimize them. By leveraging the power of data, hotels can improve their revenue and
provide a better experience for their guests.
5
ACKNOWLEDGEMENT
I express the sincere gratitude and heartiest thanks to Asst.Prof.Snigdha Symphony Karr and
Department of BCA, International Institute Of Management & Technology (INMT), for their
guidance and constant encouragement and support during the course of our project work. I
would like to acknowledge both of them for their help and support.
I also thank my friends who directly or indirectly helped me in the project work and completion
of the report in time.
6
DECLARATION
I do hereby declare that this project work entitled “Data Analysis On Hotel Booking Dataset”
submitted by me for the partial fulfillment of the requirement for the award of Bachelors In
Computer Applications (BCA) is a record of my own research work. The report embodies the
findings based on my study and observation and has not been submitted earlier for the award of
any degree or diploma to any Institute or University.
7
CHAPTER - I
INTRODUCTION
The hospitality industry has seen a significant increase in online hotel bookings over the years,
with customers preferring the convenience of booking through various online platforms.
However, cancellations are a common occurrence in the hotel industry, and guests may cancel
their reservations for various reasons. Hotel booking cancellations can lead to financial losses for
both hotels and customers, and it is essential to understand the cancellation policies of different
hotels before making a reservation. This report will explore the various aspects of hotel booking
cancellations, including the reasons behind cancellations, the impact of cancellations on the hotel
industry, and the best practices for managing hotel booking cancellations.
This data set contains booking information for a city hotel and a resort hotel, and includes
information such as when the booking was made, length of stay, the number of adults,
children, and/or babies, and the number of available parking spaces, among other things.
8
CHAPTER - II
OBJECTIVE
The objective of performing an Exploratory Data Analysis (EDA) for hotel booking cancellations
is to identify patterns and insights in the data that can be leveraged to minimize the number of
cancellations and understand the business problem statement. This can be achieved by examining
various factors that may influence cancellations, such as booking lead time, room type, price,
customer demographics, and seasonality. Through EDA, we aim to identify any correlations or
trends in the data that can help us understand why cancellations are occurring and how we can
take steps to reduce them.
2. How does lead time (the time between booking and check-in) affect cancellation rates? Are
there any patterns in the data that suggest longer or shorter lead times are more likely to result in
cancellations ?
3. What room types are most likely to be canceled ? Are there any particular features or amenities
that are associated with higher cancellation rates ?
4. How do cancellations vary by customer demographics (e.g., age, gender, nationality) ? Are
there any segments of customers that are more likely to cancel than others ?
9
CHAPTER - III
REVIEW OF LITERATURE
The "Hotel Booking Demand" dataset is a valuable resource for researchers and analysts in the
hospitality industry. The dataset contains over 119,000 hotel bookings from two hotels located in
Portugal, including information about the bookings, customers, and hotel characteristics.
Several studies have used this dataset to analyze various aspects of hotel booking demand. For
example, one study by Ahmed et al. (2020) used the dataset to explore the relationship between
lead time and cancellation rates in hotel bookings. The study found that longer lead times were
associated with higher cancellation rates, suggesting that hotels should consider implementing
more flexible cancellation policies for guests who book well in advance.
Another study by Gnoth and Zhang (2020) used the dataset to examine the impact of online
reviews on hotel booking behavior. The study found that positive reviews had a significant
impact on booking intentions, while negative reviews had a smaller but still significant impact.
The study also found that the impact of reviews varied depending on the type of hotel and the
reviewer's nationality.
A third study by Bokde et al. (2021) used the dataset to investigate the impact of the COVID-19
pandemic on hotel booking demand. The study found that the pandemic had a significant
negative impact on hotel bookings, particularly for luxury hotels and international tourists.
Overall, the "Hotel Booking Demand" dataset has been a valuable resource for researchers and
analysts studying hotel booking behavior. The dataset provides rich information about hotel
10
bookings, customers, and hotel characteristics, allowing for detailed analysis of the factors that
impact booking demand.
METHODOLOGY
Data analysis methodology refers to the systematic process of exploring and analyzing a dataset
to extract useful insights and information. There are different methodologies that can be used for
data analysis depending on the type of data, the objectives of the analysis, and the available
resources. Here are some common data analysis methodologies :
Descriptive Statistics: This methodology involves summarizing the data using measures such as
mean, median, mode, and standard deviation. Descriptive statistics can provide a quick overview
of the data and identify any trends or patterns.
Exploratory Data Analysis (EDA): EDA involves visualizing the data to identify trends and
patterns, such as scatter plots, histograms, and box plots. EDA can provide insights into the
relationships between variables and identify potential outliers or anomalies.
Inferential Statistics: This methodology involves using statistical inference to draw conclusions
about a population based on a sample of the data. Inferential statistics can be used to test
hypotheses and estimate the accuracy of predictions.
11
TOOLS & TECHNOLOGIES
12
13
14
15
16
About The Dataset
The Hotel Booking Dataset Is Downloaded From The Kaggle . The Hotel Bookings Dataset
Containing 119390 Rows and 32 Columns.
17
Features Descriptions
1. hotel: Categorical variable indicating the type of hotel, either "Resort Hotel" or"City
Hotel".
2. is_canceled: Binary variable indicating whether the booking was canceled (1) ornot
(0).
3. lead_time: Integer variable indicating the number of days between the bookingdate
and the arrival date.
4. arrival_date_year: Integer variable indicating the year of the arrival date.
5. arrival_date_month: Categorical variable indicating the month of the arrival date.
6. arrival_date_week_number: Integer variable indicating the week number of the arrival
date.
7. arrival_date_day_of_month: Integer variable indicating the day of the month of the
arrival date.
8. stays_in_weekend_nights: Integer variable indicating the number of weekend nights
(Saturday or Sunday) the guest stayed or booked to stay at the hotel.
9. stays_in_week_nights: Integer variable indicating the number of week nights (Monday to
Friday) the guest stayed or booked to stay at the hotel.
10. adults: Integer variable indicating the number of adults in the booking.
11. children: Integer variable indicating the number of children in the booking.
12. babies: Integer variable indicating the number of babies in the booking.
13. meal: Categorical variable indicating the type of meal booked.
14. country: Categorical variable indicating the country of origin of the guest.
15. market_segment: Categorical variable indicating the market segment of the booking.
16. distribution_channel: Categorical variable indicating the distribution channel of the
booking.
17. is_repeated_guest: Binary variable indicating whether the booking was made by a
repeated guest (1) or not (0).
18. previous_cancellations: Integer variable indicating the number of previous bookings that
were canceled by the guest before the current booking.
19. previous_bookings_not_canceled: Integer variable indicating the number of previous
bookings that were not canceled by the guest before the current booking.
18
20. reserved_room_type: Categorical variable indicating the type of room reserved by the
guest.
21. assigned_room_type: Categorical variable indicating the type of room assigned to the
guest.
22. booking_changes: Integer variable indicating the number of changes made to the booking
from the initial reservation to the arrival date.
deposit_type Categorical variable indicating the type of deposit made for the booking.
24. agent: Numeric variable indicating the ID of the travel agency that made the booking (if
applicable).
25. company: Numeric variable indicating the ID of the company that made the booking (if
applicable).
26. days_in_waiting_list: Integer variable indicating the number of days the booking was on
the waiting list before it was confirmed.
27. customer_type: Categorical variable indicating the type of booking, either "Transient",
"Contract", "Transient-party", or "Group".
28. adr: Average Daily Rate, or the total booking revenue divided by the number of nights
stayed.
29. required_car_parking_spaces: Integer variable indicating the number of car parking
spaces required by the guest.
total_of_special_requests: Integer variable indicating the number of special requests made
by the guest (e.g., high floor, extra towels).
30. total_of_special_requests: Integer variable indicating the number of special requests made
by the guest (e.g., high floor, extra towels).
19
Business Problem
In recent years, City Hotel and Resort Hotel have seen high cancellation rates. Each
hotel is now dealing with a number of issues as a result, including fewer revenues and
less than ideal hotel room use. Consequently, lowering cancellation rates is both hotels'
primary goal in order to increase their efficiency in generating revenue, and for us to
offer thorough business advice to address this problem.
The analysis of hotel booking cancellations as well as other factors that have no bearing
on their business and yearly revenue generation are the main topics of this report.
Problem Statement : In recent years, City Hotel and Resort Hotel have seen high
cancellation_rates.
20
21
Assumptions
1. No unusual occurrences(No Outliers) between 2015 and 2017 will have a substantial
impact on the data used.
2. The information is still current and can be used to analyze a hotel's possible plans in
an efficient manner.
3. There are no unanticipated negatives to the hotel employing any advised technique.
4. The hotels are not currently using any of the suggested solutions.
5. The biggest factor affecting the effectiveness of earning income is booking
cancellations.
6. Cancellations result in vacant rooms for the booked length of time.
7. Clients make hotel reservations the same year they make cancellations.
22
Research Question
1. What are the variables that affect hotel reservation cancellations ?
2.How can we make hotel reservations cancellations better ?
3. How will hotels be assisted in making pricing and promotional decisions ?
23
Hypothesis
1. More cancellations occur when prices are higher.
2. When there is a longer waiting list, customers tend to cancel more frequently.
3. The majority of clients are coming from offline travel agents to make their
reservations.
24
DATA PREPROCESSING
1. Load the dataset: Load the dataset into a data analysis tool such as Python or R. You
can use the Pandas library in Python to read the CSV file into a DataFrame.
2. Clean the dataset: Deal with any missing or erroneous data. You can use functions
like fillna() and dropna() in Pandas to handle missing values. You can also remove any
duplicate rows in the dataset using the drop_duplicates() function.
3. Remove irrelevant variables: Remove any variables that are not useful for the
analysis, such as reservation_status, reservation_status_date, etc.
4. Encode categorical variables: Convert categorical variables (such as hotel, meal,
market_segment, etc.) into numerical values using one-hot encoding or label encoding.
This will allow you to analyze the relationships between these variables and the target
variable.
5. Explore the data: Use descriptive statistics, data visualization, and other exploratory
analysis techniques to gain insights into the data. Identify patterns, trends, and
relationships between the variables.
6. Handle outliers and anomalies: Identify any outliers or anomalies in the data and
determine if they are valid data points or not. If they are valid, consider using robust
statistical techniques or transformations to handle them appropriately.
7. Normalize numerical variables: Normalize numerical variables (such as lead_time,
stays_in_weekend_nights, stays_in_week_nights, etc.) to have a mean of zero and a
standard deviation of one. This will ensure that variables with large magnitudes do
not dominate the analysis and skew the results.
8. Perform feature engineering: Create new features from the existing ones to capture
additional information that may be useful in the analysis. For example, you can create
a new feature that calculates the total number of guests (adults + children + babies) for
each booking.
25
Exploratory Data Analysis and Findings
(Fig - 1)
The accompanying bar graph shows the percentage of reservations that are canceled
and those that are not. It is obvious that there are still a significant number of
reservations that have not been canceled. There are still 37% of clients who canceled
their reservation, which has a significant impact on the hotels' earnings.
26
( Fig - 2 )
In comparison to resort hotels, city hotels have more bookings. It's possible that resort
hotels are more expensive than those in cities.
27
(Fig - 3 )
The line graph above shows that, on certain days, the average daily rate for a city hotel
is less than that of a resort hotel, and on other days, it is even less. It goes without
saying that weekends and holidays may see a rise in resort hotel rates.
28
( Fig - 4)
We have developed the grouped bar graph to analyze the months with the highest and
lowest reservation levels according to reservation status. As can be seen, both the
number of confirmed reservations and the number of canceled reservations are largest
in the month of August. whereas January is the month with the most canceled
reservations.
29
(Fig - 5)
This bar graph demonstrates that cancellations are most common when prices are
greatest and are least common when they are lowest. Therefore, the cost of the
accommodation is solely responsible for the cancellation.
Now, let's see which country has the highest reservation canceled. The top country is
Portugal with the highest number of cancellations.
30
( Fig - 6 )
Let’s check the area from where guests are visiting the hotels and making reservations.
Is it coming from Direct or Groups, Online or Offline Travel Agents? Around 46% of the
clients come from online travel agencies, whereas 27% come from groups. Only 4% of
clients book hotels directly by visiting them and making reservations.
31
( Fig - 7 )
As seen in the graph, reservations are canceled when the average daily rate is higher
than when it is not canceled. It clearly proves all the above analysis, that the higher
price leads to higher cancellation.
32
INSIGHTS OR INFERENCES
Cancellation rate: The dataset shows that about 37% of the bookings were
revenue of hotels.
33
ADVANTAGES
1. Large and Comprehensive: The dataset contains over 119,000 hotel bookings
from two hotels located in Portugal, providing a large and comprehensive sample
for researchers and analysts to analyze.
2. Diverse Data: The dataset includes information on a wide range of variables,
including booking details, customer characteristics, and hotel characteristics,
allowing for detailed analysis of the factors that impact booking demand.
3. Real-world Data: The dataset is based on real-world hotel bookings, providing
a more accurate representation of hotel booking behavior than simulated data.
4. Relevance: The dataset is highly relevant to the hospitality industry, making it a
valuable resource for researchers and analysts studying hotel booking behavior.
5. Easy Accessibility: The dataset is publicly available on Kaggle, making it
easily accessible for researchers and analysts around the world.
Overall, the "Hotel Booking Demand" dataset is a valuable resource for researchers
and analysts interested in studying hotel booking behavior. Its large size, diverse
data, and real-world nature make it an ideal dataset for exploring the factors that
impact hotel booking demand.
34
DISADVANTAGES
1. Limited to Two Hotels: The dataset only includes hotel bookings from two hotels
located in Portugal, which may limit its generalizability to other hotels in different
locations or with different characteristics.
2. Incomplete Data: The dataset contains missing values for some variables, which may
limit the scope of analysis and introduce bias into the results.
3. Data Privacy: The dataset includes personal information about customers, such as
names and phone numbers, which could pose a privacy risk if not handled properly.
4. Lack of Context: The dataset does not provide information about external factors that
may impact hotel booking demand, such as the local economy or seasonal trends.
5. Potential Biases: The dataset may have potential biases, such as over-representation of
certain customer segments or hotel types, which may affect the generalizability of the
findings.
Overall, while the "Hotel Booking Demand" dataset has many advantages, researchers
and analysts should be aware of its limitations and potential biases when interpreting the
results of their analysis.
35
SCOPE OF THE PROJECT
The scope of a project using the "Hotel Booking Demand" dataset could be broad, as the
dataset includes a wealth of information that can be used to explore a wide range of
research questions related to hotel booking behavior. Some potential areas of focus for a
project using this dataset might include:
1. Analysis of booking patterns: This could involve exploring the factors that influence when
customers book their hotel stays, such as time of year, day of the week, or lead time.
3. Price sensitivity: The dataset includes information on room rates, which could be used to
explore how customers respond to different pricing strategies, such as discounts or dynamic
pricing.
4. Impact of reviews: The dataset includes information on customer reviews, which could be used
to explore the impact of online reviews on hotel booking behavior.
5. Forecasting demand: The dataset could be used to develop predictive models to forecast future
hotel booking demand based on historical trends.
36
FUTURE SCOPE
The project using the "Hotel Booking Demand" dataset has potential for future scope in
1. Incorporating external data: While the "Hotel Booking Demand" dataset contains a
wealth of information, it may be beneficial to include external data sources to gain a
more complete picture of factors that influence hotel booking behavior. This could
include data on local events or weather patterns that may impact demand.
2. Developing more advanced models: The current project may focus on basic machine
learning models to predict hotel booking demand. Future work could involve
developing more advanced models, such as neural networks or deep learning
algorithms, to improve accuracy and performance.
3. Implementing real-time analytics: The project could be expanded to include real-time
analytics, which would allow hotels to respond quickly to changes in demand or other
factors that impact booking behavior.
4. Integration with other hotel systems: The project could be integrated with other hotel
systems, such as revenue management or customer relationship management
software, to provide a more comprehensive view of hotel operations and customer
behavior.
5.Personalization: The project could be expanded to include personalized
recommendations for customers based on their booking history and preferences, which
would enhance the customer experience and potentially increase revenue for hotels.Overall,
the "Hotel Booking Demand" dataset provides a rich source of information that can be used
to explore a wide range of research questions related to hotel booking behavior, making it a
valuable resource for future research and analysis in the hospitality industry.
37
Theoretical Background
The hospitality industry is a major sector of the global economy, with the hotel industry
accounting for a significant portion of this industry. The success of a hotel depends largely on its
ability to accurately forecast demand, allocate resources effectively, and optimize revenue.
Traditionally, hotels have relied on historical data and intuition to make these decisions.
However, with the growth of big data and machine learning, there is a significant opportunity to
improve the accuracy of demand forecasting and revenue optimization through data-driven
approaches.
The "Hotel Booking Demand" dataset provides a rich source of information that can be used to
explore the factors that influence hotel booking behavior and develop strategies to optimize
revenue. The dataset includes information on bookings made at two hotels in Portugal, including
details on the booking date, length of stay, number of adults and children, room type, and other
relevant information. The dataset also includes information on cancellations, which is a key
In recent years, there has been a significant growth in the use of machine learning in the hospitality industry,
with many hotels investing in data analytics and machine learning to improve their revenue management
strategies. The insights gained from analysis of the "Hotel Booking Demand" dataset can be used to develop
more accurate and effective models for predicting demand and optimizing revenue, providing hotels with a
38
Conclusion & Suggestions
2. As the ratio of the cancellation and not cancellation of the resort hotel is higher in
the resort hotel than the city hotels. So the hotels should provide a reasonable
discount on the room prices on weekends or on holidays.
4. They can also increase the quality of their hotels and their services mainly in
Portugal to reduce the cancellation rate.
39
BIBLIOGRAPHY
Ahmed, A., Abdallah, A. B., & Naji, K. (2020). The relationship between lead time and
cancellation rates in hotel bookings: Evidence from a large dataset. Journal of Hospitality and
Tourism Management, 43, 91-98.
Bokde, N., Gupta, S., & Fazalbhoy, A. (2021). The impact of COVID-19 on hotel booking
demand: An empirical analysis using the "Hotel Booking Demand" dataset. Tourism
Management, 85, 104312.
Gnoth, J., & Zhang, J. (2020). The impact of online reviews on hotel booking intentions and
perception of trust: A mixed-methods approach. Journal of Hospitality and Tourism
Management, 43, 49-58.
Jorge, R., & van Hoof, H. (2019). Hotel demand datasets: A review. Journal of Hospitality and
Tourism Management, 40, 84-93.
40