Pandas Exercise
Pandas Exercise
Feel free
to explore the file a bit before continuing with the rest of the exercise.
In [3]: hotels.head()
Out[3]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_m
Resort
0 0 342 2015 July 27
Hotel
Resort
1 0 737 2015 July 27
Hotel
Resort
2 0 7 2015 July 27
Hotel
Resort
3 0 13 2015 July 27
Hotel
Resort
4 0 14 2015 July 27
Hotel
5 rows × 36 columns
Out[4]: 119390
TASK: Is there any missing data? If so, which column has the most missing data?
In [5]: # CODE HERE
hotels.isnull().sum()
Out[5]: hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
name 0
email 0
phone-number 0
credit_card 0
dtype: int64
In [7]: hotels.drop(columns=['company'],inplace=True)
TASK: What are the top 5 most common country codes in the dataset?
In [9]: hotels['country'].value_counts()[:5]
TASK: What is the name of the person who paid the highest ADR (average daily rate)? How much was their
ADR?
In [10]: # CODE HERE
hotels.sort_values('adr',ascending=False)[['name','adr']].iloc[0]
TASK: The adr is the average daily rate for a person's stay at the hotel. What is the mean adr across all the
hotel stays in the dataset?
Out[43]: 101.83
TASK: What is the average (mean) number of nights for a stay across the entire data set? Feel free to round
this to 2 decimal points.
In [47]: round(total_night_stay.mean(),2)
Out[47]: 3.43
TASK: What is the average total cost for a stay in the dataset? Not average daily cost, but total stay cost. (You
will need to calculate total cost your self by using ADR and week day and weeknight stays). Feel free to round
this to 2 decimal points.
In [52]: round(total_cost.mean(),2)
Out[52]: 357.85
TASK: What are the names and emails of people who made exactly 5 "Special Requests"?
In [58]: # CODE HERE
hotels[hotels['total_of_special_requests']==5][['name','email']]
TASK: What percentage of hotel stays were classified as "repeat guests"? (Do not base this off the name of
the person, but instead of the is_repeated_guest column)
In [77]: round((hotels['is_repeated_guest']==1).sum()/len(hotels['is_repeated_guest'])*100,2)
Out[77]: 3.19
In [ ]:
TASK: What are the top 5 most common last name in the dataset? Bonus: Can you figure this out in one line
of pandas code? (For simplicity treat the a title such as MD as a last name, for example Caroline Conley MD
can be said to have the last name MD)
In [82]: last_name=first_last_name.str[-1]
TASK: What are the names of the people who had booked the most number children and babies for their stay?
(Don't worry if they canceled, only consider number of people reported at the time of their reservation)
In [11]: hotels['total_kids']=hotels['babies']+hotels['children']
In [17]: hotels.sort_values('total_kids',ascending=False)[['name','adults','total_kids','babies','childre
Out[17]:
name adults total_kids babies children
TASK: What are the top 3 most common area code in the phone numbers? (Area code is first 3 digits)
In [20]: area_codes.value_counts()[:3]
Out[21]: 58152
HARD BONUS TASK: Create a table for counts for each day of the week that people arrived. (E.g. 5000 arrivals
were on a Monday, 3000 were on a Tuesday, etc..)
In [52]: date_to_day=hotels['date']
In [53]: date_to_day=pd.to_datetime(date_to_day)
In [55]: date_to_day.dt.day_name().value_counts()