0% found this document useful (0 votes)
4 views2 pages

Data Table

This cheat sheet provides a comprehensive overview of the data.table package in R, highlighting its efficiency and functionality for data manipulation. It covers key operations such as subsetting, grouping, summarizing, reshaping, and joining data.tables, along with examples of syntax. Additionally, it includes methods for reading from and writing to files, as well as converting data frames to data.tables.

Uploaded by

Cirill Mikhaliev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

Data Table

This cheat sheet provides a comprehensive overview of the data.table package in R, highlighting its efficiency and functionality for data manipulation. It covers key operations such as subsetting, grouping, summarizing, reshaping, and joining data.tables, along with examples of syntax. Additionally, it includes methods for reading from and writing to files, as well as converting data frames to data.tables.

Uploaded by

Cirill Mikhaliev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Transformation with data.

table : : CHEAT SHEET


Basics Manipulate columns with j Group according to by
data.table is an extremely fast and memory efficient package
for transforming data in R. It works by converting R’s native a a a dt[, j, by = .(a)] – group rows by
EXTRACT
data frame objects into data.tables with new and enhanced values in specified columns.
functionality. The basics of working with data.tables are: dt[, c(2)] – extract columns by number. Prefix
column numbers with “-” to drop. dt[, j, keyby = .(a)] – group and
dt[i, j, by] simultaneously sort rows by values
in specified columns.
Take data.table dt, b c b c dt[, .(b, c)] – extract columns by name.
subset rows using i
COMMON GROUPED OPERATIONS
and manipulate columns with j,
grouped according to by. dt[, .(c = sum(b)), by = a] – summarize rows within groups.

data.tables are also data frames – functions that work with data SUMMARIZE dt[, c := sum(b), by = a] – create a new column and compute rows
frames therefore also work with data.tables. within groups.
a x dt[, .(x = sum(a))] – create a data.table with new
columns based on the summarized values of rows.
dt[, .SD[1], by = a] – extract first row of groups.

Create a data.table Summary functions like mean(), median(), min(),


max(), etc. can be used to summarize rows. dt[, .SD[.N], by = a] – extract last row of groups.

data.table(a = c(1, 2), b = c("a", "b")) – create a data.table from


scratch. Analogous to data.frame(). COMPUTE COLUMNS*
Chaining
setDT(df)* or as.data.table(df) – convert a data frame or a list to c dt[, c := 1 + 2] – compute a column based on
a data.table.
3 an expression. dt[…][…] – perform a sequence of data.table operations by
3 chaining multiple “[]”.

a a c dt[a == 1, c := 1 + 2] – compute a column


Subset rows using i 2
1
2
1
NA
3
based on an expression but only for a subset
of rows. Functions for data.tables
dt[1:2, ] – subset rows based on row numbers.
c d dt[, `:=`(c = 1 , d = 2)] – compute multiple
1 2 columns based on separate expressions. REORDER
1 2
a b a b setorder(dt, a, -b) – reorder a data.table
1 2 1 2 according to specified columns. Prefix column
a a dt[a > 5, ] – subset rows based on values in DELETE COLUMN 2 2 1 1 names with “-” for descending order.
1 1 2 2
2 6 one or more columns.
6 c dt[, c := NULL] – delete a column.
5

* SET FUNCTIONS AND :=


LOGICAL OPERATORS TO USE IN i CONVERT COLUMN TYPE data.table’s functions prefixed with “set” and the operator “:=”
work without “<-” to alter data without making copies in
< <= is.na() %in% | %like% b b dt[, b := as.integer(b)] – convert the type of a memory. E.g., the more efficient “setDT(df)” is analogous to
> >= !is.na() ! & %between% 1.5 1 column using as.integer(), as.numeric(), “df <- as.data.table(df)”.
2.6 2 as.character(), as.Date(), etc..

Created by Erik Petrovsky and Mara Destefanis – [email protected] • Learn more with the data.table homepage or vignette • data.table version 1.15.0 • Updated: 2024-01
UNIQUE ROWS
unique(dt, by = c("a", "b")) – extract unique
BIND
Apply function to cols.
a b a b a b a b a b rbind(dt_a, dt_b) – combine rows of two
1 2 1 2 rows based on columns specified in “by”. + = data.tables.
2 2 2 2 Leave out “by” to use all columns. APPLY A FUNCTION TO MULTIPLE COLUMNS
1 2
a b a b dt[, lapply(.SD, mean), .SDcols = c("a", "b")] –
uniqueN(dt, by = c("a", "b")) – count the number of unique rows 1 4 2 5 apply a function – e.g. mean(), as.character(),
based on columns specified in “by”. a b x y a b x y cbind(dt_a, dt_b) – combine columns
2 5 which.max() – to columns specified in .SDcols
of two data.tables.
3 6 with lapply() and the .SD symbol. Also works
+ = with groups.
RENAME COLUMNS
a a a_m cols <- c("a")
a b x y setnames(dt, c("a", "b"), c("x", "y")) – rename 1 1 2 dt[, paste0(cols, "_m") := lapply(.SD, mean),
columns. .SDcols = cols] – apply a function to specified
Reshape a data.table
2 2 2
3 3 2 columns and assign the result with suffixed
variable names to the original data.
SET KEYS RESHAPE TO WIDE FORMAT
setkey(dt, a, b) – set keys to enable fast repeated lookup in
specified columns using “dt[.(value), ]” or for merging without id y a b id a_x a_z b_x b_z dcast(dt, Sequential rows
specifying merging columns using “dt_a[dt_b]”. A x 1 3 A 1 2 3 4 id ~ y,
A x 1 3 B 1 2 3 4
value.var = c("a", "b")) ROW IDS
B z 2 4
B z 2 4
dt[, c := 1:.N, by = b] – within groups, compute a
Combine data.tables
a b a b c
Reshape a data.table from long to wide format. 1 a 1 a 1 column with sequential row IDs.
2 a 2 a 2
dt A data.table. 3 b 3 b 1
JOIN id ~ y Formula with a LHS: ID columns containing IDs for
multiple entries. And a RHS: columns with values to
LAG & LEAD
a b x y a b x dt_a[dt_b, on = .(b = y)] – join spread in column headers.
1 c 3 b 3 b 3 data.tables on rows with equal values. value.var Columns containing values to fill into cells. dt[, c := shift(a, 1), by = b] – within groups,
2 a + 2 c = 1 c 2
a
1
b
a
a
1
b
a
c
NA duplicate a column with rows lagged by
3 b 1 a 2 a 1 2 a 2 a 1 specified amount.
RESHAPE TO LONG FORMAT 3 b 3 b NA
id y a b
melt(dt, 4 b 4 b 3
a b c x y z a b c x dt_a[dt_b, on = .(b = y, c > z)] – id a_x a_z b_x b_z
measure.vars = measure ( 5 b 5 b 4 dt[, c := shift(a, 1, type = "lead"), by = b] –
A x 1 3
1 c 7 3 b 4 3 b 4 3 join data.tables on rows with A 1 2 3 4 within groups, duplicate a column with rows
2 a 5 + 2 c 5 = 1 c 5 2 equal and unequal values. B 1 2 3 4
B
A
x
z
1
2
3
4
value.name, y, sep="_")) leading by specified amount.
3 b 6 1 a 8 NA a 8 1 B z 2 4
Reshape a data.table from wide to long format.
ROLLING JOIN dt A data.table.
measure.vars Columns containing values to fill into cells, read & write files
a id date b id date a id date b often using measure() or patterns ().
1 A 01-01-2010 + 1 A 01-01-2013 = 2 A 01-01-2013 1 id.vars Character vector of ID column names (optional). IMPORT
2 A 01-01-2012 1 B 01-01-2013 2 B 01-01-2013 1
3 A 01-01-2014
variable.name, fread("file.csv") – read data from a flat file such as .csv or .tsv into R.
1 B 01-01-2010 value.name Names for output columns (optional).
2 B 01-01-2012 fread("file.csv", select = c("a", "b")) – read specified columns from a
measure(out_name1, out_name2, sep="_", pattern="([ab])_(.*)")
sep(separator) or pattern (regular expression) are used to specify flat file into R.
dt_a[dt_b, on = .(id = id, date = date), roll = TRUE] – join
data.tables on matching rows in id columns but only keep the most columns to melt, and to parse input column names.
recent preceding match with the left data.table according to date out_name1, out_name2: names for output columns (creates single value
columns. “roll = -Inf” reverses direction. column), or value.name (creates a value columns for each unique part of EXPORT
the melted column name).
fwrite(dt, "file.csv") – write data to a flat file from R.
Created by Erik Petrovsky and Mara Destefanis – [email protected]• Learn more with the data.table homepage or vignette • data.table version 1.15.0 • Updated: 2024-01

You might also like