Data Transformation

Uploaded by

Florence Cheang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Data Transformation

Uploaded by

Florence Cheang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Data transformation with dplyr : : CHEATSHEET

dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x |> f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract

Summarize Cases w
www
ww that meet logical criteria.
mtcars |> filter(mpg > 20) w
www column values as a vector, by name or index.
mtcars |> pull(wt)

distinct(.data, …, .keep_all = FALSE) Remove select(.data, …) Extract columns as a table.

w
www
Apply summary functions to columns to create a new table of

w
www
ww
rows with duplicate values. mtcars |> select(mpg, wt)
summary statistics. Summary functions take vectors as input and mtcars |> distinct(gear)
return one value (see back).
relocate(.data, …, .before = NULL, .a er = NULL)
slice(.data, …, .preserve = FALSE) Select rows

w
www
ww
summary function Move columns to new position.
by position. mtcars |> relocate(mpg, cyl, .a er = last_col())
mtcars |> slice(10:15)
summarize(.data, …)

w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
mtcars |> summarize(avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. mtcars |> select(mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows.
NULL) Count number of rows in each group defined contains(match) num_range(prefix, range) :, e.g., mpg:cyl
mtcars |> slice_sample(n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) !, e.g., !gear
by the variables in … Also tally(), add_count(),

w
ww add_tally(). starts_with(match) matches(match) everything()
mtcars |> count(cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values. MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
mtcars |> slice_min(mpg, prop = 0.25)
df <- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5))
slice_head(.data, …, n, prop) and slice_tail()
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a Select the first or last rows. across(.cols, .funs, …, .names = NULL) Summarize

w
ww
"grouped" copy of a table grouped by columns in ... dplyr mtcars |> slice_head(n = 5) or mutate multiple columns in the same way.
functions will manipulate each "group" separately and combine df |> summarize(across(everything(), mean))
the results.
Logical and boolean operators to use with filter() c_across(.cols) Compute across columns in

w
ww
== < <= is.na() %in% | xor() row-wise data.

w
www
ww mtcars |> != > >= !is.na() ! &
df |>
rowwise() |>
w
group_by(cyl) |>
summarize(avg = mean(mpg)) See ?base::Logic and ?Comparison for help. mutate(x_total = sum(c_across(1:2)))
MAKE NEW VARIABLES
ARRANGE CASES Apply vectorized functions to columns. Vectorized functions take
Use rowwise(.data, …) to group data into individual rows. dplyr arrange(.data, …, .by_group = FALSE) Order vectors as input and return vectors of the same length as output
functions will compute results for each row. Also apply functions (see back).
w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. vectorized function
mtcars |> arrange(mpg) mutate(.data, …, .keep = "all", .before = NULL,
starwars |> mtcars |> arrange(desc(mpg))

ww
www w
www
ww
.a er = NULL) Compute new column(s). Also

w
w ww
rowwise() |> add_column().
mutate(film_count = length(films)) mtcars |> mutate(gpm = 1 / mpg)
ADD CASES mtcars |> mutate(gpm = 1 / mpg, .keep = "none")
add_row(.data, …, .before = NULL, .a er = NULL)
ungroup(x, …) Returns ungrouped copy of table.

w
www
ww
Add one or more rows to a table. rename(.data, …) Rename columns. Use

w
www
w
g_mtcars <- mtcars |> group_by(cyl) cars |> add_row(speed = 1, dist = 1) rename_with() to rename with a function.
ungroup(g_mtcars) mtcars |> rename(miles_per_gallon = mpg)

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at dplyr.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • dplyr 1.1.2 • Updated: 2023-07
ft
ft
ft
ft
ft
Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARIZE () COMBINE VARIABLES COMBINE CASES
mutate() applies vectorized functions to summarize() applies summary functions to x y
columns to create new columns. Vectorized columns to create a new table. Summary A B C E F G A B C E F G A B C

functions take vectors as input and return

vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1
x
a t 1
b u 2
A B C

vectorized function summary function

bind_cols(…, .name_repair) Returns tables
placed side by side as a single table. Column
+ y
c v 3
d w 4 bind_rows(…, .id = NULL)
Returns tables one on top of the
lengths must be equal. Columns will NOT be DF A B C other as a single table. Set .id to
matched by id (to do that look at Relational Data x a t 1
a column name to add a column
OFFSET COUNT below), so be sure to check that both tables are
x
y
b
c
u
v
2
3 of the original table names (as
dplyr::lag() - o set elements by 1 dplyr::n() - number of values/rows ordered the way you want before binding. y d w 4 pictured).
dplyr::lead() - o set elements by -1 dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NAs RELATIONAL DATA
CUMULATIVE AGGREGATE
dplyr::cumall() - cumulative all() POSITION Use a "Mutating Join" to join one table to Use a "Filtering Join" to filter one table against
dplyr::cumany() - cumulative any() columns from another, matching values with the the rows of another.
cummax() - cumulative max() mean() - mean, also mean(!is.na()) rows that they correspond to. Each join retains a
median() - median x y
dplyr::cummean() - cumulative mean() di erent combination of values from the tables. A B C A B D
cummin() - cumulative min()
cumprod() - cumulative prod()
cumsum() - cumulative sum()
LOGICAL
A B C D le _join(x, y, by = NULL, copy = FALSE,
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
=
mean() - proportion of TRUEs
sum() - # of TRUEs
a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
RANKING
b u 2 2
na_matches = "na") Join matching
A B C semi_join(x, y, by = NULL, copy = FALSE,
c v 3 NA
values from y to x.
a t 1
…, na_matches = "na") Return rows of x
dplyr::cume_dist() - proportion of all values <= ORDER b u 2
that have a match in y. Use to see what
dplyr::dense_rank() - rank w ties = min, no gaps will be included in a join.
dplyr::min_rank() - rank with ties = min dplyr::first() - first value A B C D right_join(x, y, by = NULL, copy = FALSE,
dplyr::ntile() - bins into n bins dplyr::last() - last value a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
dplyr::percent_rank() - min_rank scaled to [0,1] dplyr::nth() - value in nth location of vector b u 2 2
na_matches = "na") Join matching A B C anti_join(x, y, by = NULL, copy = FALSE,
dplyr::row_number() - rank with ties = "first"
d w NA 1
values from x to y.
c v 3
…, na_matches = "na") Return rows of x
RANK that do not have a match in y. Use to see
MATH inner_join(x, y, by = NULL, copy = FALSE, what will not be included in a join.
quantile() - nth quantile A B C D

+, - , *, /, ^, %/%, %% - arithmetic ops min() - minimum value

a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
log(), log2(), log10() - logs
b u 2 2
na_matches = "na") Join data. Retain Use a "Nest Join" to inner join one table to
max() - maximum value another into a nested data frame.
<, <=, >, >=, !=, == - logical comparisons only rows with matches.
dplyr::between() - x >= le & x <= right SPREAD A B C y nest_join(x, y, by = NULL, copy =
dplyr::near() - safe == for floating point numbers A B C D full_join(x, y, by = NULL, copy = FALSE, a t 1 <tibble [1x2]>
FALSE, keep = FALSE, name =
IQR() - Inter-Quartile Range a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE, b u 2 <tibble [1x2]>
MISCELLANEOUS mad() - median absolute deviation b u 2 2 c v 3 <tibble [1x2]> NULL, …) Join data, nesting
c v 3 NA na_matches = "na") Join data. Retain all matches from y in a single new
dplyr::case_when() - multi-case if_else() sd() - standard deviation d w NA 1 values, all rows.
var() - variance data frame column.
starwars |>
mutate(type = case_when(
height > 200 | mass > 200 ~ "large",
species == "Droid" ~ "robot", Row Names COLUMN MATCHING FOR JOINS SET OPERATIONS

TRUE ~ "other") Tidy data does not use rownames, which store a A B C intersect(x, y, …)
A B.x C B.y D Use by = c("col1", "col2", …) to
) variable outside of the columns. To work with the
c v 3
Rows that appear in both x and y.
a t 1 t 3
specify one or more common
dplyr::coalesce() - first non-NA values by rownames, first move them into a column. b u 2 u 2
columns to match on.
element across a set of vectors c v 3 NA NA
setdi (x, y, …)
tibble::rownames_to_column() le _join(x, y, by = "A") A B C
dplyr::if_else() - element-wise if() + else() A B C A B
a t 1 Rows that appear in x but not y.
dplyr::na_if() - replace specific values with NA 1 a t 1 a t Move row names into col. b u 2

pmax() - element-wise max() a <- mtcars |>

A.x B.x C A.y B.y Use a named vector, by = c("col1" =
2 b u 2 b u a t 1 d w union(x, y, …)
pmin() - element-wise min() 3 c v 3 c v
rownames_to_column(var = "C") "col2"), to match on columns that A B C
b u 2 b u a t 1 Rows that appear in x or y,
c v 3 a t have di erent names in each table. b u 2
duplicates removed). union_all()
tibble::column_to_rownames() le _join(x, y, by = c("C" = "D")) c v 3
A B C A B d w 4 retains duplicates.
1 a t t 1 a
Move col into row names.
2 b u u 2 b
a |> column_to_rownames(var = "C") A1 B1 C A2 B2 Use su ix to specify the su ix to
3 c v v 3 c a t 1 d w
give to unmatched columns that Use setequal() to test whether two data sets
b u 2 b u
have the same name in both tables. contain the exact same rows (in any order).
Also tibble::has_rownames() and c v 3 a t
tibble::remove_rownames(). le _join(x, y, by = c("C" = "D"),
su ix = c("1", "2"))

Dplyr Cheatsheet PDF
100% (1)
Dplyr Cheatsheet PDF
2 pages
Data Transformation With Dplyr Cheat Sheet
No ratings yet
Data Transformation With Dplyr Cheat Sheet
2 pages
WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
No ratings yet
WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
2 pages
WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
No ratings yet
WWWWWW WWWWWW WWWWWW WWWWWW WWWW WWWW WWWWWW: Data Transformation With Dplyr
2 pages
Data Transformacion Rstudio
No ratings yet
Data Transformacion Rstudio
2 pages
Data Transformation Cheatsheet
No ratings yet
Data Transformation Cheatsheet
2 pages
Data Transformation With Dplyr - Cheatsheet
100% (1)
Data Transformation With Dplyr - Cheatsheet
2 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
Presentation 1
No ratings yet
Presentation 1
34 pages
Data Tidying With Tidyr::: Cheat Sheet
No ratings yet
Data Tidying With Tidyr::: Cheat Sheet
2 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
W4 Manipulate Dataframe
No ratings yet
W4 Manipulate Dataframe
35 pages
Business Analytics-1: STR (Crew - Data)
No ratings yet
Business Analytics-1: STR (Crew - Data)
16 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
Dplyr Mutate in R
No ratings yet
Dplyr Mutate in R
2 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
FDA Assignment 4 (1)
No ratings yet
FDA Assignment 4 (1)
34 pages
R Module 6 - Data Summarization
No ratings yet
R Module 6 - Data Summarization
25 pages
Starting With R
No ratings yet
Starting With R
34 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
Tutorial 1 - R Programming
No ratings yet
Tutorial 1 - R Programming
40 pages
DAVL prac 1
No ratings yet
DAVL prac 1
6 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
BMR Assignment: Tidyr
No ratings yet
BMR Assignment: Tidyr
3 pages
Statistics and Data Science with R Part -4
No ratings yet
Statistics and Data Science with R Part -4
23 pages
UL2
No ratings yet
UL2
2 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Apply Funcs DT
No ratings yet
Apply Funcs DT
32 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
SAS R::: Cheat Sheet
No ratings yet
SAS R::: Cheat Sheet
2 pages
Data Transformation
No ratings yet
Data Transformation
1 page
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
What Is Dplyr
No ratings yet
What Is Dplyr
23 pages
Sas R
No ratings yet
Sas R
2 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
Data Science Using R
No ratings yet
Data Science Using R
11 pages
Final DSR Lab Record
No ratings yet
Final DSR Lab Record
16 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
DS Journal
No ratings yet
DS Journal
46 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Unit2
No ratings yet
Unit2
76 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
Basics: TH TH TH TH TH TH TH
No ratings yet
Basics: TH TH TH TH TH TH TH
3 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
R Studio
No ratings yet
R Studio
13 pages
Reproducible
No ratings yet
Reproducible
41 pages
Strings
No ratings yet
Strings
2 pages
Rstudio Ide
No ratings yet
Rstudio Ide
2 pages
Data Import
No ratings yet
Data Import
2 pages
Rmarkdown
No ratings yet
Rmarkdown
2 pages
Sustainability, Case, Study
No ratings yet
Sustainability, Case, Study
1 page

Data Transformation

Uploaded by

Data Transformation

Uploaded by

Data transformation with dplyr : : CHEATSHEET

distinct(.data, …, .keep_all = FALSE) Remove select(.data, …) Extract columns as a table.

functions take vectors as input and return

vectorized function summary function

+, - , *, /, ^, %/%, %% - arithmetic ops min() - minimum value

pmax() - element-wise max() a <- mtcars |>

You might also like