Data Table
Data Table
data.tables are also data frames – functions that work with data SUMMARIZE dt[, c := sum(b), by = a] – create a new column and compute rows
frames therefore also work with data.tables. within groups.
a x dt[, .(x = sum(a))] – create a data.table with new
columns based on the summarized values of rows.
dt[, .SD[1], by = a] – extract first row of groups.
Created by Erik Petrovsky and Mara Destefanis – [email protected] • Learn more with the data.table homepage or vignette • data.table version 1.15.0 • Updated: 2024-01
UNIQUE ROWS
unique(dt, by = c("a", "b")) – extract unique
BIND
Apply function to cols.
a b a b a b a b a b rbind(dt_a, dt_b) – combine rows of two
1 2 1 2 rows based on columns specified in “by”. + = data.tables.
2 2 2 2 Leave out “by” to use all columns. APPLY A FUNCTION TO MULTIPLE COLUMNS
1 2
a b a b dt[, lapply(.SD, mean), .SDcols = c("a", "b")] –
uniqueN(dt, by = c("a", "b")) – count the number of unique rows 1 4 2 5 apply a function – e.g. mean(), as.character(),
based on columns specified in “by”. a b x y a b x y cbind(dt_a, dt_b) – combine columns
2 5 which.max() – to columns specified in .SDcols
of two data.tables.
3 6 with lapply() and the .SD symbol. Also works
+ = with groups.
RENAME COLUMNS
a a a_m cols <- c("a")
a b x y setnames(dt, c("a", "b"), c("x", "y")) – rename 1 1 2 dt[, paste0(cols, "_m") := lapply(.SD, mean),
columns. .SDcols = cols] – apply a function to specified
Reshape a data.table
2 2 2
3 3 2 columns and assign the result with suffixed
variable names to the original data.
SET KEYS RESHAPE TO WIDE FORMAT
setkey(dt, a, b) – set keys to enable fast repeated lookup in
specified columns using “dt[.(value), ]” or for merging without id y a b id a_x a_z b_x b_z dcast(dt, Sequential rows
specifying merging columns using “dt_a[dt_b]”. A x 1 3 A 1 2 3 4 id ~ y,
A x 1 3 B 1 2 3 4
value.var = c("a", "b")) ROW IDS
B z 2 4
B z 2 4
dt[, c := 1:.N, by = b] – within groups, compute a
Combine data.tables
a b a b c
Reshape a data.table from long to wide format. 1 a 1 a 1 column with sequential row IDs.
2 a 2 a 2
dt A data.table. 3 b 3 b 1
JOIN id ~ y Formula with a LHS: ID columns containing IDs for
multiple entries. And a RHS: columns with values to
LAG & LEAD
a b x y a b x dt_a[dt_b, on = .(b = y)] – join spread in column headers.
1 c 3 b 3 b 3 data.tables on rows with equal values. value.var Columns containing values to fill into cells. dt[, c := shift(a, 1), by = b] – within groups,
2 a + 2 c = 1 c 2
a
1
b
a
a
1
b
a
c
NA duplicate a column with rows lagged by
3 b 1 a 2 a 1 2 a 2 a 1 specified amount.
RESHAPE TO LONG FORMAT 3 b 3 b NA
id y a b
melt(dt, 4 b 4 b 3
a b c x y z a b c x dt_a[dt_b, on = .(b = y, c > z)] – id a_x a_z b_x b_z
measure.vars = measure ( 5 b 5 b 4 dt[, c := shift(a, 1, type = "lead"), by = b] –
A x 1 3
1 c 7 3 b 4 3 b 4 3 join data.tables on rows with A 1 2 3 4 within groups, duplicate a column with rows
2 a 5 + 2 c 5 = 1 c 5 2 equal and unequal values. B 1 2 3 4
B
A
x
z
1
2
3
4
value.name, y, sep="_")) leading by specified amount.
3 b 6 1 a 8 NA a 8 1 B z 2 4
Reshape a data.table from wide to long format.
ROLLING JOIN dt A data.table.
measure.vars Columns containing values to fill into cells, read & write files
a id date b id date a id date b often using measure() or patterns ().
1 A 01-01-2010 + 1 A 01-01-2013 = 2 A 01-01-2013 1 id.vars Character vector of ID column names (optional). IMPORT
2 A 01-01-2012 1 B 01-01-2013 2 B 01-01-2013 1
3 A 01-01-2014
variable.name, fread("file.csv") – read data from a flat file such as .csv or .tsv into R.
1 B 01-01-2010 value.name Names for output columns (optional).
2 B 01-01-2012 fread("file.csv", select = c("a", "b")) – read specified columns from a
measure(out_name1, out_name2, sep="_", pattern="([ab])_(.*)")
sep(separator) or pattern (regular expression) are used to specify flat file into R.
dt_a[dt_b, on = .(id = id, date = date), roll = TRUE] – join
data.tables on matching rows in id columns but only keep the most columns to melt, and to parse input column names.
recent preceding match with the left data.table according to date out_name1, out_name2: names for output columns (creates single value
columns. “roll = -Inf” reverses direction. column), or value.name (creates a value columns for each unique part of EXPORT
the melted column name).
fwrite(dt, "file.csv") – write data to a flat file from R.
Created by Erik Petrovsky and Mara Destefanis – [email protected]• Learn more with the data.table homepage or vignette • data.table version 1.15.0 • Updated: 2024-01