0% found this document useful (0 votes)
26 views

Data Transformation

Uploaded by

Florence Cheang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Data Transformation

Uploaded by

Florence Cheang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data transformation with dplyr : : CHEATSHEET

dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x |> f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract

Summarize Cases w
www
ww that meet logical criteria.
mtcars |> filter(mpg > 20) w
www column values as a vector, by name or index.
mtcars |> pull(wt)

distinct(.data, …, .keep_all = FALSE) Remove select(.data, …) Extract columns as a table.

w
www
Apply summary functions to columns to create a new table of

w
www
ww
rows with duplicate values. mtcars |> select(mpg, wt)
summary statistics. Summary functions take vectors as input and mtcars |> distinct(gear)
return one value (see back).
relocate(.data, …, .before = NULL, .a er = NULL)
slice(.data, …, .preserve = FALSE) Select rows

w
www
ww
summary function Move columns to new position.
by position. mtcars |> relocate(mpg, cyl, .a er = last_col())
mtcars |> slice(10:15)
summarize(.data, …)

w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
mtcars |> summarize(avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. mtcars |> select(mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows.
NULL) Count number of rows in each group defined contains(match) num_range(prefix, range) :, e.g., mpg:cyl
mtcars |> slice_sample(n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) !, e.g., !gear
by the variables in … Also tally(), add_count(),

w
ww add_tally(). starts_with(match) matches(match) everything()
mtcars |> count(cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values. MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
mtcars |> slice_min(mpg, prop = 0.25)
df <- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5))
slice_head(.data, …, n, prop) and slice_tail()
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a Select the first or last rows. across(.cols, .funs, …, .names = NULL) Summarize

w
ww
"grouped" copy of a table grouped by columns in ... dplyr mtcars |> slice_head(n = 5) or mutate multiple columns in the same way.
functions will manipulate each "group" separately and combine df |> summarize(across(everything(), mean))
the results.
Logical and boolean operators to use with filter() c_across(.cols) Compute across columns in

w
ww
== < <= is.na() %in% | xor() row-wise data.

w
www
ww mtcars |> != > >= !is.na() ! &
df |>
rowwise() |>
w
group_by(cyl) |>
summarize(avg = mean(mpg)) See ?base::Logic and ?Comparison for help. mutate(x_total = sum(c_across(1:2)))
MAKE NEW VARIABLES
ARRANGE CASES Apply vectorized functions to columns. Vectorized functions take
Use rowwise(.data, …) to group data into individual rows. dplyr arrange(.data, …, .by_group = FALSE) Order vectors as input and return vectors of the same length as output
functions will compute results for each row. Also apply functions (see back).
w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. vectorized function
mtcars |> arrange(mpg) mutate(.data, …, .keep = "all", .before = NULL,
starwars |> mtcars |> arrange(desc(mpg))

ww
www w
www
ww
.a er = NULL) Compute new column(s). Also

w
w ww
rowwise() |> add_column().
mutate(film_count = length(films)) mtcars |> mutate(gpm = 1 / mpg)
ADD CASES mtcars |> mutate(gpm = 1 / mpg, .keep = "none")
add_row(.data, …, .before = NULL, .a er = NULL)
ungroup(x, …) Returns ungrouped copy of table.

w
www
ww
Add one or more rows to a table. rename(.data, …) Rename columns. Use

w
www
w
g_mtcars <- mtcars |> group_by(cyl) cars |> add_row(speed = 1, dist = 1) rename_with() to rename with a function.
ungroup(g_mtcars) mtcars |> rename(miles_per_gallon = mpg)

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at dplyr.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • dplyr 1.1.2 • Updated: 2023-07
ft
ft
ft
ft
ft
Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARIZE () COMBINE VARIABLES COMBINE CASES
mutate() applies vectorized functions to summarize() applies summary functions to x y
columns to create new columns. Vectorized columns to create a new table. Summary A B C E F G A B C E F G A B C

functions take vectors as input and return


vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1
x
a t 1
b u 2
A B C

vectorized function summary function


bind_cols(…, .name_repair) Returns tables
placed side by side as a single table. Column
+ y
c v 3
d w 4 bind_rows(…, .id = NULL)
Returns tables one on top of the
lengths must be equal. Columns will NOT be DF A B C other as a single table. Set .id to
matched by id (to do that look at Relational Data x a t 1
a column name to add a column
OFFSET COUNT below), so be sure to check that both tables are
x
y
b
c
u
v
2
3 of the original table names (as
dplyr::lag() - o set elements by 1 dplyr::n() - number of values/rows ordered the way you want before binding. y d w 4 pictured).
dplyr::lead() - o set elements by -1 dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NAs RELATIONAL DATA
CUMULATIVE AGGREGATE
dplyr::cumall() - cumulative all() POSITION Use a "Mutating Join" to join one table to Use a "Filtering Join" to filter one table against
dplyr::cumany() - cumulative any() columns from another, matching values with the the rows of another.
cummax() - cumulative max() mean() - mean, also mean(!is.na()) rows that they correspond to. Each join retains a
median() - median x y
dplyr::cummean() - cumulative mean() di erent combination of values from the tables. A B C A B D
cummin() - cumulative min()
cumprod() - cumulative prod()
cumsum() - cumulative sum()
LOGICAL
A B C D le _join(x, y, by = NULL, copy = FALSE,
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
=
mean() - proportion of TRUEs
sum() - # of TRUEs
a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
RANKING
b u 2 2
na_matches = "na") Join matching
A B C semi_join(x, y, by = NULL, copy = FALSE,
c v 3 NA
values from y to x.
a t 1
…, na_matches = "na") Return rows of x
dplyr::cume_dist() - proportion of all values <= ORDER b u 2
that have a match in y. Use to see what
dplyr::dense_rank() - rank w ties = min, no gaps will be included in a join.
dplyr::min_rank() - rank with ties = min dplyr::first() - first value A B C D right_join(x, y, by = NULL, copy = FALSE,
dplyr::ntile() - bins into n bins dplyr::last() - last value a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
dplyr::percent_rank() - min_rank scaled to [0,1] dplyr::nth() - value in nth location of vector b u 2 2
na_matches = "na") Join matching A B C anti_join(x, y, by = NULL, copy = FALSE,
dplyr::row_number() - rank with ties = "first"
d w NA 1
values from x to y.
c v 3
…, na_matches = "na") Return rows of x
RANK that do not have a match in y. Use to see
MATH inner_join(x, y, by = NULL, copy = FALSE, what will not be included in a join.
quantile() - nth quantile A B C D

+, - , *, /, ^, %/%, %% - arithmetic ops min() - minimum value


a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
log(), log2(), log10() - logs
b u 2 2
na_matches = "na") Join data. Retain Use a "Nest Join" to inner join one table to
max() - maximum value another into a nested data frame.
<, <=, >, >=, !=, == - logical comparisons only rows with matches.
dplyr::between() - x >= le & x <= right SPREAD A B C y nest_join(x, y, by = NULL, copy =
dplyr::near() - safe == for floating point numbers A B C D full_join(x, y, by = NULL, copy = FALSE, a t 1 <tibble [1x2]>
FALSE, keep = FALSE, name =
IQR() - Inter-Quartile Range a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE, b u 2 <tibble [1x2]>
MISCELLANEOUS mad() - median absolute deviation b u 2 2 c v 3 <tibble [1x2]> NULL, …) Join data, nesting
c v 3 NA na_matches = "na") Join data. Retain all matches from y in a single new
dplyr::case_when() - multi-case if_else() sd() - standard deviation d w NA 1 values, all rows.
var() - variance data frame column.
starwars |>
mutate(type = case_when(
height > 200 | mass > 200 ~ "large",
species == "Droid" ~ "robot", Row Names COLUMN MATCHING FOR JOINS SET OPERATIONS

TRUE ~ "other") Tidy data does not use rownames, which store a A B C intersect(x, y, …)
A B.x C B.y D Use by = c("col1", "col2", …) to
) variable outside of the columns. To work with the
c v 3
Rows that appear in both x and y.
a t 1 t 3
specify one or more common
dplyr::coalesce() - first non-NA values by rownames, first move them into a column. b u 2 u 2
columns to match on.
element across a set of vectors c v 3 NA NA
setdi (x, y, …)
tibble::rownames_to_column() le _join(x, y, by = "A") A B C
dplyr::if_else() - element-wise if() + else() A B C A B
a t 1 Rows that appear in x but not y.
dplyr::na_if() - replace specific values with NA 1 a t 1 a t Move row names into col. b u 2

pmax() - element-wise max() a <- mtcars |>


A.x B.x C A.y B.y Use a named vector, by = c("col1" =
2 b u 2 b u a t 1 d w union(x, y, …)
pmin() - element-wise min() 3 c v 3 c v
rownames_to_column(var = "C") "col2"), to match on columns that A B C
b u 2 b u a t 1 Rows that appear in x or y,
c v 3 a t have di erent names in each table. b u 2
duplicates removed). union_all()
tibble::column_to_rownames() le _join(x, y, by = c("C" = "D")) c v 3
A B C A B d w 4 retains duplicates.
1 a t t 1 a
Move col into row names.
2 b u u 2 b
a |> column_to_rownames(var = "C") A1 B1 C A2 B2 Use su ix to specify the su ix to
3 c v v 3 c a t 1 d w
give to unmatched columns that Use setequal() to test whether two data sets
b u 2 b u
have the same name in both tables. contain the exact same rows (in any order).
Also tibble::has_rownames() and c v 3 a t
tibble::remove_rownames(). le _join(x, y, by = c("C" = "D"),
su ix = c("1", "2"))

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at dplyr.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • dplyr 1.1.2 • Updated: 2023-07
ft
ft
ft
ft
ff
ff
ff
ff
ff
ff
ff
ff
ff
ft
ff
ff
ft
ff

You might also like