0% found this document useful (0 votes)
76 views2 pages

WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr

Uploaded by

Manuel Herrera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views2 pages

WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr

Uploaded by

Manuel Herrera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Transformation with dplyr : : CHEAT SHEET

dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …) Extract rows that meet logical pull(.data, var = -1) Extract column values as a

Summarise Cases w
www
ww criteria.
filter(mtcars, mpg > 20) w
www
vector. Choose by name or index.
pull(mtcars, wt)

distinct(.data, ..., .keep_all = FALSE) Remove select(.data, …) Extract columns as a table. Also

w
www
These apply summary functions to columns to create a new

w
www
ww
rows with duplicate values. select_if().
table of summary statistics. Summary functions take vectors as distinct(mtcars, gear) select(mtcars, mpg, wt)
input and return one value (see back).
slice(.data, …) Select rows by position. relocate(.data, …, .before = NULL, .a er = NULL)

w
www
ww
summary function slice(mtcars, 10:15) Move columns to new position.
relocate(mtcars, mpg, cyl, .a er = last_col())

w
www
ww
summarise(.data, …) slice_sample(.data, ..., n, prop, weight_by =

w
ww
Compute table of summaries. NULL, replace = FALSE) Randomly select rows.
summarise(mtcars, avg = mean(mpg)) Use n to select a number of rows and prop to Use these helpers with select() and across()
select a fraction of rows. e.g. select(mtcars, mpg:cyl)
count(x, ..., wt = NULL, sort = FALSE) slice_sample(mtcars, n = 5, replace = TRUE) contains(match) num_range(prefix, range) :, e.g. mpg:cyl
Count number of rows in each group defined by ends_with(match) one_of(…) -, e.g, -gear

w
ww
the variables in … Also tally(). slice_min(.data, order_by, ..., n, prop, with_ties matches(match) starts_with(match) everything()
count(mtcars, cyl) = TRUE) and slice_max() Select rows with the

w
www
ww
lowest and highest values.
slice_min(mtcars, mpg, prop = 0.25)
MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases slice_head(.data, ..., n, prop) and slice_tail() across(.cols, .funs) Summarise or mutate multiple
Select the first or last rows.
w
ww
Use group_by(.data, ..., .add = FALSE) to create a "grouped" copy columns in the same way.
slice_head(mtcars, n = 5) summarise(mtcars, across(everything(), mean))
of a table grouped by columns in ... dplyr functions will
manipulate each "group" separately and combine the results.
c_across(.cols) Compute across columns in

w
ww
Logical and boolean operators to use with filter() row-wise data.
transmute(rowwise(UKgas), n = sum(c_across(1:2)))

w
www
ww < <= is.na() %in% | xor()
mtcars %>% > >= !is.na() ! &

w
group_by(cyl) %>% MAKE NEW VARIABLES
summarise(avg = mean(mpg)) See ?base::Logic and ?Comparison for help.
These apply vectorized functions to columns. Vectorized funs take
ARRANGE CASES vectors as input and return vectors of the same length as output
(see back).
Use rowwise(.data, ...) to group data into individual rows. dplyr arrange(.data, …) Order rows by values of a vectorized function

w
www
ww
functions will compute results for each row. Also used to apply column or columns (low to high), use with
functions to list-columns without purrr functions. desc() to order from high to low. mutate(.data, …, .before = NULL, .a er = NULL)
arrange(mtcars, mpg)

w
www
ww
Compute new column(s). Also add_column(),
arrange(mtcars, desc(mpg)) add_count(), and add_tally().
starwars %>% mutate(mtcars, gpm = 1/mpg)

w
www
www
ww
rowwise() %>% ADD CASES
mutate(film_count = length(films)) transmute(.data, …) Compute new column(s),

ungroup(x, …) Returns ungrouped copy of table.


w
www
ww
add_row(.data, ..., .before = NULL, .a er = NULL)
Add one or more rows to a table.
add_row(cars, speed = 1, dist = 1)
w
ww drop others.
transmute(mtcars, gpm = 1/mpg)

ungroup(g_mtcars) rename(.data, …) Rename columns.

w
wwww rename(cars, distance = dist)

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 1.0.6 • tibble 3.1.2 • Updated: 2021-06


ft



ft

ft

ft

Vectorized Functions Summary Functions Combine Tables


TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C A B D A B C A B D A B C

Vectorized functions take vectors as input and


return vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1 x
a
b
c
t
u
v
1
2
3

A B C

vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4

OFFSETS COUNTS bind_cols(…) Returns tables placed side by


dplyr::n() - number of values/rows side as a single table.  Use bind_rows() to paste tables below each
dplyr::lag() - O set elements by 1 BE SURE THAT ROWS ALIGN.
dplyr::lead() - O set elements by -1 dplyr::n_distinct() - # of uniques other as they are.
sum(!is.na()) - # of non-NA’s
CUMULATIVE AGGREGATES Use a "Mutating Join" to join one table to bind_rows(…, .id = NULL)
LOCATION DF
x
A
a
B
t
C
1
dplyr::cumall() - Cumulative all() columns from another, matching values with Returns tables one on top of the other
dplyr::cumany() - Cumulative any() mean() - mean, also mean(!is.na()) the rows that they correspond to. Each join
x
x
b
c
u
v
2
3 as a single table. Set .id to a column
cummax() - Cumulative max() median() - median retains a di erent combination of values from z c v 3 name to add a column of the original
z d w 4
dplyr::cummean() - Cumulative mean() the tables. table names (as pictured)
cummin() - Cumulative min() LOGICALS
cumprod() - Cumulative prod() mean() - Proportion of TRUE’s A B C D le _join(x, y, by = NULL, A B C
cumsum() - Cumulative sum() sum() - # of TRUE’s a
b
t
u
1
2
3
2
copy=FALSE, su ix=c(“.x”,“.y”),…) c v 3
intersect(x, y, …)
c v 3 NA Join matching values from y to x. Rows that appear in both x and y.
RANKINGS POSITION/ORDER A B C
A B C D right_join(x, y, by = NULL, copy = a t 1 setdi (x, y, …)
dplyr::cume_dist() - Proportion of all values <= dplyr::first() - first value FALSE, su ix=c(“.x”,“.y”),…) b u 2
dplyr::dense_rank() - rank w ties = min, no gaps dplyr::last() - last value
a t 1 3 Rows that appear in x but not y.
b u 2 2
Join matching values from x to y.
dplyr::min_rank() - rank with ties = min dplyr::nth() - value in nth location of vector d w NA 1 A B C

dplyr::ntile() - bins into n bins


a t 1 union(x, y, …)
A B C D inner_join(x, y, by = NULL, copy = b u 2
Rows that appear in x or y.
dplyr::percent_rank() - min_rank scaled to [0,1] RANK FALSE, su ix=c(“.x”,“.y”),…)
c v 3
dplyr::row_number() - rank with ties = "first"
a t 1 3
b u 2 2
d w 4 (Duplicates removed). union_all()
quantile() - nth quantile  Join data. Retain only rows with retains duplicates.
min() - minimum value matches.
MATH max() - maximum value
+, - , *, /, ^, %/%, %% - arithmetic ops A B C D full_join(x, y, by = NULL, copy=FALSE, Use setequal() to test whether two data sets
log(), log2(), log10() - logs SPREAD a
b
t
u
1
2
3
2
su ix=c(“.x”,“.y”),…) contain the exact same rows (in any order).
<, <=, >, >=, !=, == - logical comparisons IQR() - Inter-Quartile Range c v 3 NA Join data. Retain all values, all rows.
dplyr::between() - x >= le & x <= right mad() - median absolute deviation
d w NA 1

dplyr::near() - safe == for floating point numbers sd() - standard deviation EXTRACT ROWS
var() - variance x y
MISC
A B.x C B.y D Use by = c("col1", "col2", …) to A B C A B D
dplyr::case_when() - multi-case if_else()
starwars %>% mutate(type = case_when( Row Names a
b
c
t
u
v
1
2
3
t 3
u 2
specify one or more common
columns to match on.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
=
height > 200 | mass > 200 ~ "large", Tidy data does not use rownames, which store a
NA NA
le _join(x, y, by = "A")
species == "Droid" ~ "robot", variable outside of the columns. To work with the
TRUE ~ "other")) rownames, first move them into a column. A.x B.x C A.y B.y Use a named vector, by = c("col1" = Use a "Filtering Join" to filter one table against
dplyr::coalesce() - first non-NA values by element C A B
a t 1 d w
"col2"), to match on columns that the rows of another.
 across a set of vectors A B
rownames_to_column() b u 2 b u
have di erent names in each table.
c v 3 a t
dplyr::if_else() - element-wise if() + else() 1 a t 1 a t Move row names into col. le _join(x, y, by = c("C" = "D")) semi_join(x, y, by = NULL, …)
A B C
dplyr::na_if() - replace specific values with NA 2 b u 2 b u a <- rownames_to_column(mtcars,  a t 1 Return rows of x that have a match in y.
pmax() - element-wise max() 3 c v 3 c v
var = "C") A1 B1 C A2 B2 Use su ix to specify the su ix to b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
pmin() - element-wise min() a t 1 d w
give to unmatched columns that
dplyr::recode() - Vectorized switch() A B C A B column_to_rownames() b u 2 b u
have the same name in both tables. A B C anti_join(x, y, by = NULL, …)
dplyr::recode_factor() - Vectorized switch() 1
2
a
b
t
u
t 1 a
Move col into row names.  c v 3 a t
le _join(x, y, by = c("C" = "D"), su ix = c v 3 Return rows of x that do not have a match
u 2 b
for factors 3 c v v 3 c column_to_rownames(a, var = "C") c("1", "2")) in y. USEFUL TO SEE WHAT WILL NOT BE
JOINED.
Also has_rownames(), remove_rownames()

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 1.0.6 • tibble 3.1.2 • Updated: 2021-06
ft
ft
ft
ft
ff
ff

ff
ff

ff
ff

ff

ff

ff
ff

ft


ff

ff

You might also like