0% found this document useful (0 votes)
237 views1 page

Data Manipulation With Dplyr in R Cheat Sheet

The document provides examples of using the dplyr package in R to manipulate data frames. It shows how to: 1) Create new columns by combining or transforming existing columns. 2) Filter rows based on conditions involving one or more columns like country, number of rooms. 3) Group and summarize data by adding counts of observations per group like number of listings per city. 4) Join data frames from different tables on common columns like listing_id.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
237 views1 page

Data Manipulation With Dplyr in R Cheat Sheet

The document provides examples of using the dplyr package in R to manipulate data frames. It shows how to: 1) Create new columns by combining or transforming existing columns. 2) Filter rows based on conditions involving one or more columns like country, number of rooms. 3) Group and summarize data by adding counts of observations per group like number of listings per city. 4) Join data frames from different tables on common columns like listing_id.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Creating new columns with dplyr

> Combining tables in dplyr


# Create a time_on_market column using the difference of today’s year and the year_listed

airbnb_listings %>%
x1 x2 x1 x2

Data Manipulation with dplyr in R mutate(time_on_market = 2022 - year_listed)

1
3
2
6
1
4
2
6

Cheat Sheet
# Create a full_address column by combining city and country
5 4 2 5
airbnb_listings %>%

transmute(full_address = paste(city, country))

df_1 df_2

Learn R online at www.DataCamp.com # Add the number of observations for a column (e.g., number of listings per city)
# Appending a table to the right side (horizontal) of another

bind_cols(df_1, df_2)
airbnb_listings %>%

add_count(city)
# Appending a table to the bottom (vertical) of another

bind_rows(df_1, df_2)

> Helpful Syntax


Working with rows # Combining rows that exist in both tables and dropping duplicates

union(df_1, df_2)
Installing and loading dplyr
# Filter rows on one condition (e.g., country)
# Finding identical columns in both tables

# Install dplyr through tidyverse


airbnb_listings %>%
intersect(df_1, df_2)
install.packages(“tidyverse”)

filter(country=="France")
# Finding rows that don’t exist in another table

# Install it directly
# Filter OR more conditions (country OR number_of_rooms)

on two setdiff(df_1, df_2)


install.packages(“dplyr”)

airbnb_listings %>%

filter(country=="France" | number_of_rooms > 3)


# Load dplyr into R

library(dplyr) # Filter AND more conditions (country AND


on two number_of_rooms)

The %>% operator


airbnb_listings %>%

filter(country=="France" & number_of_rooms > 3) > Joining Tables with dplyr


%>% is a special operator in R found in the magrittr and dplyr packages. %>% lets you pass objects to functions # Filter by checking if a value exists in another set of values

elegantly, and helps you make your code more readable. Consider this example of choosing columns a and b from the airbnb_listings %>%
To showcase joins in dplyr, we’ll use an additional dataset containing details on host_listings for airbnb listings
dataframe df filter(country %in% c("Japan", "France"))
# Without the %>% operator
airbnb_listings
# Filter rows based on index of rows (e.g., first 3 rows)
listing_id city country number_of_rooms year_listed
select(df, a, b)

Airbnb_listings %>%
1 Paris France 5 2018
slice(1:3) 2 Tokyo Japan 2 2017
# By using the %>% operator

df %>% select(a, b) 3 New York USA 2 2022


# Sort rows by values in a column in ascending order

airbnb_listings %>%

host_listings
arrange(number_of_rooms)
host_id name listing_id number_of_reviews

> Dataset used throughout this cheat sheet # Sort rows by values in a column in descending order

airbnb_listings %>%

1
2
Jen Bricker
Richie Cotton
1
2
34
12
3 Raven Todd Dasliva 3 55
arrange(desc(city))
Throughout this cheat sheet, we weill be using this example dataset called airbnb_listings, containing Airbnb
listings with data on their location, year listed, number of rooms, and more. # Remove duplicate rows in all the dataset

airbnb_listings
airbnb_listings %>%

distinct() Joining tables in dplyr


listing_id city country number_of_rooms year_listed
1 Paris France 5 2018 # Find unique values in the country column

airbnb_listings %>%
Inner Join
2 Tokyo Japan 2 2017
3 New York USA 2 2022 distinct(country) # Returns only records where a joining field finds a match in both tables.

airbnb_listings %>%

# Select rows based on top-n values of a column (e.g., top 3 listings with the highest amount inner_join(host_listings, by="listing_id")
of rooms)

> Transforming data with dplyr airbnb_listings %>%

top_n(3, number_of_rooms) Left Join


# Returns rows in left table and
missing values for any columns from the
Basic column operations with dplyr right table where joining field did not find a match

host_listings %>%

# Select one or more columns with select()

airbnb_listings %>%

> Aggregating data with dplyr left_join(airbnb_listings, by="listing_id")

select(listing_id, city) Right join


# Count groups within a column (e.g., count number of cities in airbnb_listings)

# Select columns based on start characters


airbnb_listings %>%
# Returns rows in right table and
missing values for any columns from the
count(city) left table where joining field did not find a match

airbnb_listings %>%

host_listings %>%

select(starts_with("c"))
# Count groups within a column and return sorted
right_join(airbnb_listings, by="listing_id")
# Select columns based on end characters
airbnb_listings %>%

airbnb_listings %>%
count(country, sort=TRUE)
Full Join
select(ends_with("s"))
# Return the total sum of values for a column (e.g., total number of rooms)
# Returns all records from both table, irrespective of whether there is a
# Select all but one column (e.g., listing_id)
airbnb_listings %>%
match on the joining field

airbnb_listings %>%
summarise(total_rooms=sum(number_of_rooms)) host_listings %>%

select(-listing_id) full_join(airbnb_listings, by="listing_id")


# Return the average of values for a column (e.g, average number of rooms in a given listing)

# Select all columns within a range


airbnb_listings %>%

summarise(avg_room=mean(number_of_rooms))
Anti Join
airbnb_listings %>%

select(country:year_listed) # Returns records in the first table and excludes matching values from the
# Return a custom summary statistic (e.g., average amount of time a listing stays on)
second table

# Reorder columns using relocate()


airbnb_listings %>%
airbnb_listings %>%

airbnb_listings %>%
summarise(average_listing_duration= 2022 - mean(year_listed)) anti_join(host_listings, by="listing_id")
relocate(city, country)
# Group by a variable and return counts of each group (e.g., number of listings by country)

# Rename a column using rename()


airbnb_listings %>%

airbnb_listings %>%
group_by(country) %>%

rename(year=year_listed) summarise(n=n())

# Select columns matching a regular expression


# Group by a variable and return the average value per group (e.g., average number of rooms
in listings per city)

airbnb_listings %>%

select(matches("(.n.)|(n.)")) airbnb_listings %>%


Learn Data Skills Online at www.DataCamp.com
group_by(city) %>%

summarise(avg_rooms=mean(number_of_rooms))

You might also like