Data - Table: Nesting and Unnesting Data
Data - Table: Nesting and Unnesting Data
3 Tyson S. Barrett1
4
1 Utah State University
5 Abstract
The use of list-columns in data frames and tibbles in the R statistical en-
vironment is well documented (e.g. Bryan, 2018), providing a cognitively
efficient way to organize results of complex data (e.g. several statistical mod-
els, groupings of text, data summaries, or even graphics) with corresponding
data. For example, one can store student information within classrooms,
player information within teams, or even analyses within groups. This allows
the data to be of variable sizes without overly complicating or adding redun-
dancies to the structure of the data. In turn, this can improve the reliability
to appropriately analyze the data.
Because of its efficiency and speed, being able to use data.table to work
6 with list-columns would be beneficial in many data contexts (e.g. to reduce
memory usage in large data sets). Herein, I demonstrate how one can
create list-columns in a data table using the by argument in data.table
and purrr::map(). This is done using an example data set containing
information on professional basketball players in the United States. I compare
the behavior of the data.table approach to the dplyr::group_nest()
function, one of the several powerful tidyverse nesting functions. Results
using bench::mark() show the speed and efficiency of using data.table to
work with list-columns.
7 Introduction
8 The use of list-columns in data frames and tibbles in the R statistical environment (R
9 Core Team, 2018) provides a cognitively efficient way to organize complex data (e.g. several
10 statistical models, groupings of text, data summaries, or even graphics) with corresponding
Correspondence concerning this article should be addressed to Tyson S. Barrett, 2800 Old Main, Logan,
UT 84322. E-mail: [email protected]
LIST-COLUMNS IN DATA.TABLE 2
11 data in a concise manner. It has become a common approach to wrangling data in the
12 tidyverse, with functions across dplyr and tidyr providing functionality to work with
13 list-columns (Bryan, 2018; Wickham et al., 2019; Wickham & Henry, 2019). This format is
14 often called “nested” data, where information is, in essence, nested within a column of data.
15 For example, list-columns can be used to nest data regarding students within class-
16 rooms, players within teams, measures within individuals, and text within chapters. This
17 allows the user to do certain data manipulations within each group in a consistent, controlled
18 manner. This can ensure that accidentally including data from other groups does not occur.
19 Furthermore, nesting can reduce the difficulty to appropriately analyze the data stored in
20 the list-column. Using functions like lapply() or purrr::map*() makes further analysis of
21 the nested data more intuitive and error-free.
22 Because of its efficiency and speed, being able to use data.table to work with list-
23 columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data
24 sets, increase speed of calculations). Herein, I demonstrate how one can create list-columns
25 in a data table using the by argument in data.table (using a custom function) and the
26 purrr::map*() functions. I further highlight the dplyr::group_nest() function and show
27 a slightly more efficient approach when using a data table. Using bench::mark(), I assess
28 the speed and efficiency of using data.table to work with list-columns.
29 This article relies on several powerful R packages, including data.table, dplyr,
30 bench, tidyr, papaja, stringr, ggplot2, ggbeeswarm, ggrepel, performance, rvest,
31 and lobstr (Aust & Barth, 2018; Clarke & Sherrill-Mix, 2017; Dowle & Srinivasan, 2019;
32 Hester, 2019; Lüdecke, Makowski, & Waggoner, 2019; Slowikowski, 2018; Wickham, 2016,
33 2019c, 2019b, 2019a; Wickham et al., 2019; Wickham & Henry, 2019).
34 Example Data
player_2018 <-
LIST-COLUMNS IN DATA.TABLE 3
extract_fun(players_2018) %>%
mutate(year = 2018,
AGE = as.numeric(AGE))
player_2019 <-
extract_fun(players_2019) %>%
mutate(year = 2019)
players <-
bind_rows(player_2018, player_2019) %>%
clean_names() %>%
rename(ppg = ppg_points_points_per_game,
apg = apg_assists_assists_per_game) %>%
data.table()
40 Below is a subset of this imported data set, showing only four of the variables and the first
41 six rows.
55 Overall, this function is efficient and fast but—by relying on data.table—it can be
56 somewhat faster. This will be shown using the following function, which relies solely on the
57 syntax of data.table, using the j and by arguments as shown below.
group_nest_dt <- function(dt, ..., .key = "data"){
stopifnot(is.data.table(dt))
by <- substitute(list(...))
58 First thing to note, is that in the data table, we create a list within a list containing
59 the .SD special object. This object is all the data in the data table except for the variables
60 that are in the by argument. The by argument, before being evaluated within the data
61 table, first becomes an unevaluated list of bare variable names and then evaluate it within
62 the data.table syntax. In essence, this function takes a data table, then creates a list of
63 the data table per group specified in the by argument.
head(group_nest_dt(players, team))
64 ## team data
65 ## 1: Min <data.table>
66 ## 2: Orl <data.table>
67 ## 3: Dal <data.table>
68 ## 4: Hou <data.table>
69 ## 5: Bos <data.table>
70 ## 6: Ind <data.table>
71 The syntax and output are nearly identical to the dplyr::group_nest() function
72 but has data tables in the list-column instead of tibbles.
head(group_nest(players, team))
73 ## # A tibble: 6 x 2
74 ## team data
75 ## <chr> <list>
76 ## 1 Atl <tibble [44 x 30]>
77 ## 2 Bos <tibble [37 x 30]>
78 ## 3 Bro <tibble [41 x 30]>
79 ## 4 Cha <tibble [34 x 30]>
80 ## 5 Chi <tibble [43 x 30]>
81 ## 6 Cle <tibble [49 x 30]>
Figure 1 . Diagram of one approach to creating a list-column in a data frame (i.e. creating a
nested data frame).
LIST-COLUMNS IN DATA.TABLE 5
Figure 2 . Speed comparisons for each nesting approach. Note the scale of the y-axis is log10 .
82 Given both perform very similar data manipulations, it is of interest to see if there
83 are differences in memory and speed performance. Figure 2 presents the timings from
84 bench::mark() across the two approaches, showing group_nest_dt() is often faster, al-
85 though differences for this size of data set are not meaningful. The memory allocated is
86 also very similar, with group_nest_dt() allocating 463KB and group_nest() allocating
87 335KB.
88 This nesting approach can be used with multiple grouping variables too. For example,
89 I show how a user could nest by both team and year, as is done below.
head(group_nest_dt(players, team, year))
98 Often, the nested data can provide an intuitive format to run several analyses to
99 understand key features of the data within the groups. Below, the relationship between
100 points-per-game and assists-per-game for each team and year is modeled and then the R2
LIST-COLUMNS IN DATA.TABLE 6
101 of the models are extracted. Since performance::r2() provides two versions of R2 , I then
102 grab only the first of the two types.
players_nested <- group_nest_dt(players, team, year)
players_nested[, ppg_apg := purrr::map(data, ~lm(ppg ~ apg, data = .x))]
players_nested[, r2_list := purrr::map(ppg_apg, ~performance::r2(.x))]
players_nested[, r2_ppg_apg := purrr::map_dbl(r2_list, ~.x[[1]])]
head(players_nested)
110 This produces two list-columns (ppg_apg and r2_list) and a numeric vector (r2_ppg_apg)
111 all organized by team and year. This information is then readily available to plot. For
112 example, one can look at how related points-per-game and assists-per-game are by team
113 and year—in essence, showing which teams have players who both score and assist. The
114 example plot is shown in Figure 3.
players_nested %>%
dcast(team ~ year, value.var = "r2_ppg_apg") %>%
ggplot(aes(`2018`, `2019`, group = team)) +
geom_point() +
geom_text_repel(aes(label = team)) +
geom_abline(slope = 1) +
coord_fixed(ylim = c(0,1),
xlim = c(0,1))
116 After performing the manipulations or analyses within the nest, it can often be
117 necessary to unnest to finalize analyses. Again, like with group_nest_dt(), the unnest_dt()
118 function below relies solely on the syntax of data.table, using the j and by arguments as
119 shown below.
unnest_dt <- function(dt, col, id){
stopifnot(is.data.table(dt))
by <- substitute(id)
col <- substitute(unlist(col, recursive = FALSE))
Figure 3 . Example analysis performed using nested data to provide information for each
team and year.
120 This function can be used to unnest a data table, like the players_nested data table
121 from earlier, where the nested column can be a data table, data frame, or tibble. Below, the
122 data column in the table is unnested by team and year and then a few of the variables are
123 selected for demonstration purposes.
players_unnested <- unnest_dt(players_nested,
col = data,
id = list(team, year))
players_unnested[, .(team, year, full_name, pos, age, gp, mpg)]
Figure 4 . Speed comparisons for each unnesting approach. Note the scale of the y-axis is
log10 .
141 A slight variation of this function can be used for list-columns with atomic vectors
142 instead of data tables. A function like the following works well.
unnest_vec_dt <- function(dt, cols, id, name){
stopifnot(is.data.table(dt))
by <- substitute(id)
cols <- substitute(unlist(cols,
recursive = FALSE))
143 In players_nested, the r2_list column is a list of numeric vectors. This can be
144 unnested as shown below, providing the two measures of R2 per team per year.
unnest_vec_dt(players_nested,
cols = list(r2_list),
id = list(team, year),
name = "r2")
LIST-COLUMNS IN DATA.TABLE 9
158 Last item to demonstrate herein is the computer memory usage of different formats of
159 data tables with the same data. We can use the following large data sets in wide format,
160 nested wide format, long format, and nested wide format to make brief comparisons.
# Wide
wide_format <- data.table(id = 1:1e6,
x1 = rnorm(1e6),
x2 = rnorm(1e6),
y1 = rnorm(1e6),
y2 = rnorm(1e6),
group = rbinom(1e6, 1, .5))
nested_wide_format <- group_nest_dt(wide_format, group)
# Long
long_format <- melt.data.table(wide_format,
id.vars = c("id", "group"),
measure.vars = c("x1", "x2", "y1", "y2"))
nested_long_format <- group_nest_dt(long_format, group)
161 I use the lobstr package to assess the object size of each format of the same data,
162 shown in Table 1. Not surprising, the memory usage of nested data is lower than for its
163 none nested corresponding data. This is directly related to the reduction in redundancies in
164 the data otherwise there. That is, the nested data has far fewer rows containing the group
165 variable. That, alone, in this large data saves memory. For example, the size of a single
166 column of the group variable in wide format is 4 MB; and in long format it is 16 MB By
167 reducing a single variable in this case, we save several megabytes of memory.
168 Discussion
169 List-columns are a useful approach to structuring data into a format that can be
170 safely cleaned, manipulated, and analyzed by groups. It also provides for a more cognitively
LIST-COLUMNS IN DATA.TABLE 10
Table 1
Memory usage for each format of the same data
Format Memory (MB)
Wide Format 40.0
Nested Wide Format 36.0
Long Format 80.0
Nested Long Format 64.0
171 efficient way for a user to understand their data, allowing large data to be represented more
172 concisely within groups.
173 The tidyverse provides several functions to work with nested data, which are relatively
174 quick and efficient. For most data situations, these functions will do all that a user will need.
175 However, in some situations, data.table can perform needed manipulations and analyses
176 that cannot otherwise be done or that would take too long to complete. In these situations,
177 and for users that prefer to use data.table, this tutorial can help provide direction in using
178 list-columns.
179 Furthermore, as expected, the memory usage of nested data is lower than for its none
180 nested corresponding data. This is due to the reduction in the redundancies present in wide
181 and long format. This suggests that it is not only the cognitive benefits to the user that
182 makes this format more efficient.
183 Limitations
184 There are some notable limitations to list-columns in general, and in data.table
185 specifically. First, the three custom functions built on data.table presented herein are not
186 well-tested and are certainly not expected to work in each case where dplyr::group_nest(),
187 tidyr::unnest(), and other tidy functions would work. Rather, they were presented to
188 show how a user can leverage the speed and efficiency of data.table to create, and work
189 with, list-columns.
190 Second, it is important to realize that nested data can remove the ability to use
191 vectorization across groups. Depending on the analyses being conducted, this may slow
192 down the computation to the point that nested data actually is a hindrance to performance.
193 Finally, when using list-columns in tibbles, the print method provides the dimensions
194 of each corresponding nested tibble. This method is helpful in understanding the nested
195 data without any need to extract it. This could be a minor, but valuable, update to the
196 print method in data.table.
197 Conclusions
198 The use of list-columns in data.table is very similar to that in the tidyverse. It
199 provides speed and efficiency in both nesting and unnesting the data, and can be used with
200 the purrr::map*() and other powerful functions.
LIST-COLUMNS IN DATA.TABLE 11
201 References
202 Aust, F., & Barth, M. (2018). papaja: Create APA manuscripts with R Markdown. Retrieved
203 from https://github.com/crsh/papaja
204 Bryan, J. (2018). List columns (as part of "purrr tutorial"). Retrieved from https://jennybc.
205 github.io/purrr-tutorial/ls13_list-columns.html
206 Clarke, E., & Sherrill-Mix, S. (2017). Ggbeeswarm: Categorical scatter (violin point) plots.
207 Retrieved from https://github.com/eclarke/ggbeeswarm
208 Dowle, M., & Srinivasan, A. (2019). Data.table: Extension of ‘data.frame‘. Retrieved from
209 https://CRAN.R-project.org/package=data.table
210 Hester, J. (2019). Bench: High precision timing of r expressions. Retrieved from https:
211 //CRAN.R-project.org/package=bench
212 Lüdecke, D., Makowski, D., & Waggoner, P. (2019). Performance: Assessment of regression
213 models performance. Retrieved from https://easystats.github.io/performance/
214 R Core Team. (2018). R: A language and environment for statistical computing. Vienna,
215 Austria: R Foundation for Statistical Computing. Retrieved from https://www.
216 R-project.org/
217 Slowikowski, K. (2018). Ggrepel: Automatically position non-overlapping text labels with
218 ’ggplot2’. Retrieved from https://CRAN.R-project.org/package=ggrepel
219 Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis. Springer-Verlag New
220 York. Retrieved from https://ggplot2.tidyverse.org
221 Wickham, H. (2019a). Lobstr: Visualize r data structures with trees. Retrieved from
222 https://CRAN.R-project.org/package=lobstr
223 Wickham, H. (2019b). Rvest: Easily harvest (scrape) web pages. Retrieved from https:
224 //CRAN.R-project.org/package=rvest
225 Wickham, H. (2019c). Stringr: Simple, consistent wrappers for common string operations.
226 Retrieved from https://CRAN.R-project.org/package=stringr
227 Wickham, H., François, R., Henry, L., & Müller, K. (2019). Dplyr: A grammar of data
228 manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
229 Wickham, H., & Henry, L. (2019). Tidyr: Tidy messy data. Retrieved from https://CRAN.
230 R-project.org/package=tidyr