0% found this document useful (0 votes)
163 views6 pages

Pandas Exercise

Uploaded by

vesike3421
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views6 pages

Pandas Exercise

Uploaded by

vesike3421
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

TASK: Run the following code to read in the "hotel_booking_data.csv" file.

Feel free
to explore the file a bit before continuing with the rest of the exercise.

In [1]: import pandas as pd

In [2]: hotels = pd.read_csv("C:\\Users\\HP\\Desktop\\Python\\Code\\UNZIP_FOR_NOTEBOOKS_FINAL\\03-Pandas

In [3]: hotels.head()

Out[3]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_m

Resort
0 0 342 2015 July 27
Hotel

Resort
1 0 737 2015 July 27
Hotel

Resort
2 0 7 2015 July 27
Hotel

Resort
3 0 13 2015 July 27
Hotel

Resort
4 0 14 2015 July 27
Hotel

5 rows × 36 columns

TASK: How many rows are there?

In [4]: # CODE HERE


len(hotels)

Out[4]: 119390

TASK: Is there any missing data? If so, which column has the most missing data?
In [5]: # CODE HERE
hotels.isnull().sum()

Out[5]: hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
name 0
email 0
phone-number 0
credit_card 0
dtype: int64

In [6]: print(f"Yes, missing data, company column missing: {hotels['company'].isna().sum()} rows.")

Yes, missing data, company column missing: 112593 rows.

TASK: Drop the "company" column from the dataset.

In [7]: hotels.drop(columns=['company'],inplace=True)

TASK: What are the top 5 most common country codes in the dataset?

In [9]: hotels['country'].value_counts()[:5]

Out[9]: PRT 48590


GBR 12129
FRA 10415
ESP 8568
DEU 7287
Name: country, dtype: int64

TASK: What is the name of the person who paid the highest ADR (average daily rate)? How much was their
ADR?
In [10]: # CODE HERE
hotels.sort_values('adr',ascending=False)[['name','adr']].iloc[0]

Out[10]: name Daniel Walter


adr 5400.0
Name: 48515, dtype: object

TASK: The adr is the average daily rate for a person's stay at the hotel. What is the mean adr across all the
hotel stays in the dataset?

In [43]: # CODE HERE


round(hotels['adr'].mean(),2)

Out[43]: 101.83

TASK: What is the average (mean) number of nights for a stay across the entire data set? Feel free to round
this to 2 decimal points.

In [46]: # CODE HERE


total_night_stay=hotels['stays_in_week_nights']+hotels['stays_in_weekend_nights']

In [47]: round(total_night_stay.mean(),2)

Out[47]: 3.43

TASK: What is the average total cost for a stay in the dataset? Not average daily cost, but total stay cost. (You
will need to calculate total cost your self by using ADR and week day and weeknight stays). Feel free to round
this to 2 decimal points.

In [49]: # CODE HERE


total_cost=hotels['adr']*total_night_stay

In [52]: round(total_cost.mean(),2)

Out[52]: 357.85

TASK: What are the names and emails of people who made exactly 5 "Special Requests"?
In [58]: # CODE HERE
hotels[hotels['total_of_special_requests']==5][['name','email']]

Out[58]: name email

7860 Amanda Harper [email protected]

11125 Laura Sanders [email protected]

14596 Tommy Ortiz [email protected]

14921 Gilbert Miller [email protected]

14922 Timothy Torres [email protected]

24630 Jennifer Weaver [email protected]

27288 Crystal Horton [email protected]

27477 Brittney Burke [email protected]

29906 Cynthia Cabrera [email protected]

29949 Sarah Floyd [email protected]

32267 Michelle Villa [email protected]

39027 Nichole Hebert [email protected]

39129 Lindsey Mckenzie [email protected]

39525 Ashley Edwards [email protected]

70114 Christopher Torres [email protected]

78819 Mrs. Tara Sullivan DVM [email protected]

78820 Michaela Brown [email protected]

78822 Kurt Maldonado MD [email protected]

97072 Jason Richardson [email protected]

97099 Terri Hurley [email protected]

97261 Mrs. Caitlin Webb [email protected]

98410 Holly Arroyo [email protected]

98674 Denise Campbell [email protected]

99887 Michael Smith [email protected]

99888 Dr. Trevor Sellers [email protected]

101569 Kayla Murphy [email protected]

102061 Taylor Martinez [email protected]

109511 Charles Wilson [email protected]

109590 Tyler Allison [email protected]

110082 Matthew Bailey [email protected]

110083 Charlotte Acevedo [email protected]

111909 Darrell Brennan [email protected]

111911 Melinda Jensen [email protected]

113915 Terry Arnold [email protected]

114770 Mary Nguyen [email protected]

114909 Lindsay Cuevas [email protected]

116455 Cynthia Hernandez [email protected]

116457 Angela Hawkins [email protected]

118817 Sue Lawson [email protected]

119161 Alyssa Richards [email protected]

TASK: What percentage of hotel stays were classified as "repeat guests"? (Do not base this off the name of
the person, but instead of the is_repeated_guest column)
In [77]: round((hotels['is_repeated_guest']==1).sum()/len(hotels['is_repeated_guest'])*100,2)

Out[77]: 3.19

In [ ]: ​

TASK: What are the top 5 most common last name in the dataset? Bonus: Can you figure this out in one line
of pandas code? (For simplicity treat the a title such as MD as a last name, for example Caroline Conley MD
can be said to have the last name MD)

In [80]: #CODE HERE


first_last_name=hotels['name'].str.split()

In [82]: last_name=first_last_name.str[-1]

In [86]: hotels['name'].apply(lambda name: name.split()[1]).value_counts()[:5]

Out[86]: Smith 2510


Johnson 1998
Williams 1628
Jones 1441
Brown 1433
Name: name, dtype: int64

TASK: What are the names of the people who had booked the most number children and babies for their stay?
(Don't worry if they canceled, only consider number of people reported at the time of their reservation)

In [11]: hotels['total_kids']=hotels['babies']+hotels['children']

In [17]: hotels.sort_values('total_kids',ascending=False)[['name','adults','total_kids','babies','childre

Out[17]:
name adults total_kids babies children

328 Jamie Ramirez 2 10.0 0 10.0

46619 Nicholas Parker 2 10.0 10 0.0

78656 Marc Robinson 1 9.0 9 0.0

19718 Mr. Jeffrey Cross 2 3.0 0 3.0

107837 Albert French 2 3.0 2 1.0

... ... ... ... ... ...

119389 Ariana Michael 2 0.0 0 0.0

40600 Craig Campos 2 NaN 0 NaN

40667 David Murphy 2 NaN 0 NaN

40679 Frank Burton 3 NaN 0 NaN

41160 Jerry Roberts 2 NaN 0 NaN

119390 rows × 5 columns

TASK: What are the top 3 most common area code in the phone numbers? (Area code is first 3 digits)

In [18]: #CODE HERE


area_codes=hotels['phone-number'].str[:3]

In [20]: area_codes.value_counts()[:3]

Out[20]: 799 168


185 167
541 166
Name: phone-number, dtype: int64
TASK: How many arrivals took place between the 1st and the 15th of the month (inclusive of 1 and 15) ?
Bonus: Can you do this in one line of pandas code?

In [21]: #CODE HERE


hotels['arrival_date_day_of_month'].apply(lambda day:day in range(1,16)).sum()

Out[21]: 58152

HARD BONUS TASK: Create a table for counts for each day of the week that people arrived. (E.g. 5000 arrivals
were on a Monday, 3000 were on a Tuesday, etc..)

In [47]: def convert_to_proper(day,month,year):


return f'{day}-{month}-{year}'

In [50]: import numpy as np


hotels['date']=np.vectorize(convert_to_proper)(hotels['arrival_date_day_of_month'],
hotels['arrival_date_month'],
hotels['arrival_date_year'])

In [52]: date_to_day=hotels['date']

In [53]: date_to_day=pd.to_datetime(date_to_day)

In [55]: date_to_day.dt.day_name().value_counts()

Out[55]: Friday 19631


Thursday 19254
Monday 18171
Saturday 18055
Wednesday 16139
Sunday 14141
Tuesday 13999
Name: date, dtype: int64

You might also like