0% found this document useful (0 votes)
43 views2 pages

Data Transformation With Data - Table: Cheat Sheet

Data.table is a package for efficiently transforming and manipulating data in R. It converts R's native data frames into data.tables with enhanced functionality. Data.tables allow users to subset rows, select and manipulate columns, summarize and group data using syntax like dt[i,j,by]. They provide fast operations for tasks like subsetting, grouping, joining, and updating data.

Uploaded by

KGAOGELO Moloko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views2 pages

Data Transformation With Data - Table: Cheat Sheet

Data.table is a package for efficiently transforming and manipulating data in R. It converts R's native data frames into data.tables with enhanced functionality. Data.tables allow users to subset rows, select and manipulate columns, summarize and group data using syntax like dt[i,j,by]. They provide fast operations for tasks like subsetting, grouping, joining, and updating data.

Uploaded by

KGAOGELO Moloko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Transformation with data.

table : : CHEAT SHEET


Basics Manipulate columns with j COMMON GROUPED OPREATIONS
dt[, .(sum_b = sum(b)), by = .(a)] – summarize rows within groups.
data.table is an extremely fast and memory efficient package
for transforming data in R. It works by converting R’s native EXTRACT
data frame objects into data.tables with new and enhanced dt[, c := sum(b), by = .(a)] – create a new column and compute rows
functionality. The basics of working with data.tables are: within groups.
dt[, c(2)] – select column(s) by number. Prefix
column numbers with “-” to deselect.
dt[, .SD[1], by = .(a)] – extract first row of groups.
dt[i, j, by]
dt[, .SD[.N], by = .(a)] – extract last row of groups.
Take data.table dt, b c b c dt[, .(b, c)] – select column(s) by name.
subset rows using i,
and manipulate columns with j,
grouped according to by. Chaining
data.tables are also data frames – functions that work with data SUMMARIZE dt[…][…] – perform a sequence of data.table operations by
frames therefore also work with data.tables. chaining multiple “[]”.
a x dt[, .(x = sum(a))] – create a data.table with new
columns based on the summarized values of rows.

Create a data.table Summary functions such as mean(), median(),


min(), max(), etc. may be used to summarize rows.
Functions for data.tables
data.table(a = c(1, 2), b = c(“a”, “b”)) – create a data.table from ARRANGE ROWS
scratch. Analogous to data.frame(). ADD COLUMN
a b a b setorder(dt, a, -b) – arrange the rows of a
setDT(df)* or as.data.table(df) – convert a data frame or a list to c dt[, c := 1 + 2] – create a new column based on 1 2 1 1 data.table. Prefix variable names with “-”
a data.table.
3 an expression. 2 2 2 2 for descending order.
3 2 1 2 1
3

UNIQUE CASES
Subset rows using i a a c dt[a == 1, c := 1 + 2] – create a new column a b a b unique(dt, a, b) – extract a subset of the
2 2 NA based on an expression but only for a subset of 1 2 1 2 data based on a unique combination of
dt[1:2, ] – subset rows based on row numbers. 1 1 3 rows. 2 2 2 2 values.
2 2 NA
1 1 1 1
1 2

a a dt[a > 5, ] – subset rows based on the values in Group according to by uniqueN(dt, by = c(“a”, “b”)) – return the number of unique rows,
based on columns specified in “by”. Leave out “by” to use all
2 6 one or more columns. columns.
6 a a a dt[, j, by = .(a)] – group rows by the
5 values in one or more columns.
* SET FUNCTIONS
Use “keyby = .(a)” for grouping and
LOGICAL OPERATORS TO USE IN i simultaneously sorting according to data.table provides a collection of functions beginning with
group column(s). “set”. They work without “<-” to alter data.tables in place. For
< <= is.na() %in% | %like% instance, “setDT(dt)” works like “dt <- as.data.table(dt)” but
> >= !is.na() ! & %between% without creating any copies in memory.

CC BY SA Erik Petrovski • [email protected] • www.petrovski.dk • Learn more with the data.table webpage or vignette • data.table version 1.11.4 • Updated: 2018-08
RENAME COLUMNS
a b x y
BIND
a b a b rbind(dt_a, dt_b) – combine rows of two
.SD
setnames(dt, c(“a”, “b”), c(“x”, “y”)) – rename a b
multiple columns. + = data.tables Refer to a Subset of the Data within a data.table
with .SD.

SET KEYS MULTIPLE COLUMN TYPE CONVERSION


a b x y a b x y cbind(dt_a, dt_b) – combine
setkey(dt, a, b) – set keys in a data.table to enable faster repeated columns of two data.tables dt[, lapply(.SD, as.character), .SDcols = c(“a”, “b”)] – convert
lookups in specified columns using “dt[.(value), ]” or for merging + = designated columns to character
without specifying merging columns “dt_a[dt_b]”.
GROUP OPTIMA
dt[, .SD[which.max(a)], by = b] – select the row with the highest
Combine data.tables Reshape a data.table value of within a column grouped according to b. Also works with
which.min() and which(). Similar to .SD[.N] and .SD[1] on page 1.

JOIN DCAST
a b x
3
y
b
a b x
dt_a[dt_b, .on(b = y)] – join two id y a b id a_X a_Z b_X b_Z dcast(dt, fread & fwrite
+ =
1 c 3 b 3 A X 1 3 A 1 2 3 4
data.tables based on rows with equal id ~ y,
2 a 2 c 1 c 2 A Z 2 4 B 1 2 3 4
3 b 1 a 2 a 1 values. setkey() can be used in stead B X 1 3 value.var = c(“a”, “b”)) fread & fwrite are data.table’s fast and multithreaded functions for
of “.on”. B Z 2 4 importing from and exporting to flat files – such as csv and tsv.

a b c x y z a b c x dt_a[dt_b, .on(b = y, c > z)] – Reshape a data.table from long to wide format.
3 b 4 IMPORT
+ =
1 c 7 3 b 4 3
2 c 5
join two data.tables based on dt A data.table.
2 a 5 1 c 5 2
3 b 6 1 a 8 NA a 8 1
rows with equal and unequal id ~ y Formula with a LHS: id column(s) containing id(s) for fread(“file.csv”) – read a flat file into R.
values multiple entries. And a RHS: column(s) with value(s) to
spread in column headers. fread(“file.csv”, cols = c(“a”, “b”)) – read two columns named “a”
value.var Column(s) containing values to fill into cells. and “b” from a file named “file.csv” in the working directory.
ROLLING JOIN

By default, a rolling join matches rows, defined by an id variable, MELT EXPORT


but only keeps the most recent preceding match with the left table,
defined by a date variable. id a_X a_Z b_X b_Z id y a b melt(dt, fwrite(dt, file =“”) – write a flat file from R.
A 1 2 3 4 A 1 1 3 id.vars = c("id"),
B 1 2 3 4 B 1 1 3
a id date b id date a id date b measure = patterns("^a", "^b"), MULTITHREADING
+ =
A 2 2 4
1 A 01-01-2010 1 A 01-01-2013 2 A 01-01-2013 1 B 2 2 4 variable.name = "y",
2 A 01-01-2012 1 B 01-01-2013 2 B 01-01-2013 1
value.name = c("a", "b")) setDTthreads() – set the number of threads that fread may use.
3 A 01-01-2014 Default is all available and appropriate for the task at hand.
1 B 01-01-2010
2 B 01-01-2012 Reshape a data.table from wide to long format.
dt A data.table.
# first set keys # then roll id.vars Id column(s) with id(s) for multiple entries.
setkey(dt_a, id, date) dt_a[dt_b, roll = TRUE] measure Column(s) containing values to fill into cells
setkey(dt_b, id, date) (often in pattern form).
variable.name, Name(s) of new column(s) for variables and values
dt[, roll = +Inf] – reverse the direction of the rolling join. value.name derived from old headers.

CC BY SA Erik Petrovski • [email protected] • www.petrovski.dk • Learn more with the data.table webpage or vignette • data.table version 1.11.4 • Updated: 2018-08

You might also like