Data Manipulation With Dplyr in R Cheat Sheet
Data Manipulation With Dplyr in R Cheat Sheet
airbnb_listings %>%
x1 x2 x1 x2
1
3
2
6
1
4
2
6
Cheat Sheet
# Create a full_address column by combining city and country
5 4 2 5
airbnb_listings %>%
df_1 df_2
Learn R online at www.DataCamp.com # Add the number of observations for a column (e.g., number of listings per city)
# Appending a table to the right side (horizontal) of another
bind_cols(df_1, df_2)
airbnb_listings %>%
add_count(city)
# Appending a table to the bottom (vertical) of another
bind_rows(df_1, df_2)
union(df_1, df_2)
Installing and loading dplyr
# Filter rows on one condition (e.g., country)
# Finding identical columns in both tables
# Install it directly
# Filter OR more conditions (country OR number_of_rooms)
airbnb_listings %>%
elegantly, and helps you make your code more readable. Consider this example of choosing columns a and b from the airbnb_listings %>%
To showcase joins in dplyr, we’ll use an additional dataset containing details on host_listings for airbnb listings
dataframe df filter(country %in% c("Japan", "France"))
# Without the %>% operator
airbnb_listings
# Filter rows based on index of rows (e.g., first 3 rows)
listing_id city country number_of_rooms year_listed
select(df, a, b)
Airbnb_listings %>%
1 Paris France 5 2018
slice(1:3) 2 Tokyo Japan 2 2017
# By using the %>% operator
airbnb_listings %>%
host_listings
arrange(number_of_rooms)
host_id name listing_id number_of_reviews
> Dataset used throughout this cheat sheet # Sort rows by values in a column in descending order
airbnb_listings %>%
1
2
Jen Bricker
Richie Cotton
1
2
34
12
3 Raven Todd Dasliva 3 55
arrange(desc(city))
Throughout this cheat sheet, we weill be using this example dataset called airbnb_listings, containing Airbnb
listings with data on their location, year listed, number of rooms, and more. # Remove duplicate rows in all the dataset
airbnb_listings
airbnb_listings %>%
airbnb_listings %>%
Inner Join
2 Tokyo Japan 2 2017
3 New York USA 2 2022 distinct(country) # Returns only records where a joining field finds a match in both tables.
airbnb_listings %>%
# Select rows based on top-n values of a column (e.g., top 3 listings with the highest amount inner_join(host_listings, by="listing_id")
of rooms)
host_listings %>%
airbnb_listings %>%
airbnb_listings %>%
host_listings %>%
select(starts_with("c"))
# Count groups within a column and return sorted
right_join(airbnb_listings, by="listing_id")
# Select columns based on end characters
airbnb_listings %>%
airbnb_listings %>%
count(country, sort=TRUE)
Full Join
select(ends_with("s"))
# Return the total sum of values for a column (e.g., total number of rooms)
# Returns all records from both table, irrespective of whether there is a
# Select all but one column (e.g., listing_id)
airbnb_listings %>%
match on the joining field
airbnb_listings %>%
summarise(total_rooms=sum(number_of_rooms)) host_listings %>%
summarise(avg_room=mean(number_of_rooms))
Anti Join
airbnb_listings %>%
select(country:year_listed) # Returns records in the first table and excludes matching values from the
# Return a custom summary statistic (e.g., average amount of time a listing stays on)
second table
airbnb_listings %>%
summarise(average_listing_duration= 2022 - mean(year_listed)) anti_join(host_listings, by="listing_id")
relocate(city, country)
# Group by a variable and return counts of each group (e.g., number of listings by country)
airbnb_listings %>%
group_by(country) %>%
rename(year=year_listed) summarise(n=n())
airbnb_listings %>%
summarise(avg_rooms=mean(number_of_rooms))