WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …) Extract rows that meet logical pull(.data, var = -1) Extract column values as a
Summarise Cases w
www
ww criteria.
filter(mtcars, mpg > 20) w
www
vector. Choose by name or index.
pull(mtcars, wt)
distinct(.data, ..., .keep_all = FALSE) Remove select(.data, …) Extract columns as a table. Also
w
www
These apply summary functions to columns to create a new
w
www
ww
rows with duplicate values. select_if().
table of summary statistics. Summary functions take vectors as distinct(mtcars, gear) select(mtcars, mpg, wt)
input and return one value (see back).
slice(.data, …) Select rows by position. relocate(.data, …, .before = NULL, .a er = NULL)
w
www
ww
summary function slice(mtcars, 10:15) Move columns to new position.
relocate(mtcars, mpg, cyl, .a er = last_col())
w
www
ww
summarise(.data, …) slice_sample(.data, ..., n, prop, weight_by =
w
ww
Compute table of summaries. NULL, replace = FALSE) Randomly select rows.
summarise(mtcars, avg = mean(mpg)) Use n to select a number of rows and prop to Use these helpers with select() and across()
select a fraction of rows. e.g. select(mtcars, mpg:cyl)
count(x, ..., wt = NULL, sort = FALSE) slice_sample(mtcars, n = 5, replace = TRUE) contains(match) num_range(prefix, range) :, e.g. mpg:cyl
Count number of rows in each group defined by ends_with(match) one_of(…) -, e.g, -gear
w
ww
the variables in … Also tally(). slice_min(.data, order_by, ..., n, prop, with_ties matches(match) starts_with(match) everything()
count(mtcars, cyl) = TRUE) and slice_max() Select rows with the
w
www
ww
lowest and highest values.
slice_min(mtcars, mpg, prop = 0.25)
MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases slice_head(.data, ..., n, prop) and slice_tail() across(.cols, .funs) Summarise or mutate multiple
Select the first or last rows.
w
ww
Use group_by(.data, ..., .add = FALSE) to create a "grouped" copy columns in the same way.
slice_head(mtcars, n = 5) summarise(mtcars, across(everything(), mean))
of a table grouped by columns in ... dplyr functions will
manipulate each "group" separately and combine the results.
c_across(.cols) Compute across columns in
w
ww
Logical and boolean operators to use with filter() row-wise data.
transmute(rowwise(UKgas), n = sum(c_across(1:2)))
w
www
ww < <= is.na() %in% | xor()
mtcars %>% > >= !is.na() ! &
w
group_by(cyl) %>% MAKE NEW VARIABLES
summarise(avg = mean(mpg)) See ?base::Logic and ?Comparison for help.
These apply vectorized functions to columns. Vectorized funs take
ARRANGE CASES vectors as input and return vectors of the same length as output
(see back).
Use rowwise(.data, ...) to group data into individual rows. dplyr arrange(.data, …) Order rows by values of a vectorized function
w
www
ww
functions will compute results for each row. Also used to apply column or columns (low to high), use with
functions to list-columns without purrr functions. desc() to order from high to low. mutate(.data, …, .before = NULL, .a er = NULL)
arrange(mtcars, mpg)
w
www
ww
Compute new column(s). Also add_column(),
arrange(mtcars, desc(mpg)) add_count(), and add_tally().
starwars %>% mutate(mtcars, gpm = 1/mpg)
w
www
www
ww
rowwise() %>% ADD CASES
mutate(film_count = length(films)) transmute(.data, …) Compute new column(s),
w
wwww rename(cars, distance = dist)
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 1.0.6 • tibble 3.1.2 • Updated: 2021-06
ft
ft
ft
ft
A B C
vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4
dplyr::near() - safe == for floating point numbers sd() - standard deviation EXTRACT ROWS
var() - variance x y
MISC
A B.x C B.y D Use by = c("col1", "col2", …) to A B C A B D
dplyr::case_when() - multi-case if_else()
starwars %>% mutate(type = case_when( Row Names a
b
c
t
u
v
1
2
3
t 3
u 2
specify one or more common
columns to match on.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
=
height > 200 | mass > 200 ~ "large", Tidy data does not use rownames, which store a
NA NA
le _join(x, y, by = "A")
species == "Droid" ~ "robot", variable outside of the columns. To work with the
TRUE ~ "other")) rownames, first move them into a column. A.x B.x C A.y B.y Use a named vector, by = c("col1" = Use a "Filtering Join" to filter one table against
dplyr::coalesce() - first non-NA values by element C A B
a t 1 d w
"col2"), to match on columns that the rows of another.
across a set of vectors A B
rownames_to_column() b u 2 b u
have di erent names in each table.
c v 3 a t
dplyr::if_else() - element-wise if() + else() 1 a t 1 a t Move row names into col. le _join(x, y, by = c("C" = "D")) semi_join(x, y, by = NULL, …)
A B C
dplyr::na_if() - replace specific values with NA 2 b u 2 b u a <- rownames_to_column(mtcars, a t 1 Return rows of x that have a match in y.
pmax() - element-wise max() 3 c v 3 c v
var = "C") A1 B1 C A2 B2 Use su ix to specify the su ix to b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
pmin() - element-wise min() a t 1 d w
give to unmatched columns that
dplyr::recode() - Vectorized switch() A B C A B column_to_rownames() b u 2 b u
have the same name in both tables. A B C anti_join(x, y, by = NULL, …)
dplyr::recode_factor() - Vectorized switch() 1
2
a
b
t
u
t 1 a
Move col into row names. c v 3 a t
le _join(x, y, by = c("C" = "D"), su ix = c v 3 Return rows of x that do not have a match
u 2 b
for factors 3 c v v 3 c column_to_rownames(a, var = "C") c("1", "2")) in y. USEFUL TO SEE WHAT WILL NOT BE
JOINED.
Also has_rownames(), remove_rownames()
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 1.0.6 • tibble 3.1.2 • Updated: 2021-06
ft
ft
ft
ft
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ft
ff
ff