0% found this document useful (0 votes)
15 views

Data - Table: Nesting and Unnesting Data

This document discusses nesting data tables and vectors in R using list-columns. It demonstrates how to create list-columns in a data table using the by argument in data.table and purrr::map(). It compares the performance of this approach to using dplyr::group_nest(), showing that the data.table approach is slightly faster based on benchmarking, though the differences are small for this size of data. The document uses NBA player stats data to illustrate nesting the data by team.

Uploaded by

José Lopes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Data - Table: Nesting and Unnesting Data

This document discusses nesting data tables and vectors in R using list-columns. It demonstrates how to create list-columns in a data table using the by argument in data.table and purrr::map(). It compares the performance of this approach to using dplyr::group_nest(), showing that the data.table approach is slightly faster based on benchmarking, though the differences are small for this size of data. The document uses NBA player stats data to illustrate nesting the data by team.

Uploaded by

José Lopes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1 List-columns in data.

table: Nesting and unnesting data


2 tables and vectors

3 Tyson S. Barrett1
4
1 Utah State University

5 Abstract

The use of list-columns in data frames and tibbles in the R statistical en-
vironment is well documented (e.g. Bryan, 2018), providing a cognitively
efficient way to organize results of complex data (e.g. several statistical mod-
els, groupings of text, data summaries, or even graphics) with corresponding
data. For example, one can store student information within classrooms,
player information within teams, or even analyses within groups. This allows
the data to be of variable sizes without overly complicating or adding redun-
dancies to the structure of the data. In turn, this can improve the reliability
to appropriately analyze the data.
Because of its efficiency and speed, being able to use data.table to work
6 with list-columns would be beneficial in many data contexts (e.g. to reduce
memory usage in large data sets). Herein, I demonstrate how one can
create list-columns in a data table using the by argument in data.table
and purrr::map(). This is done using an example data set containing
information on professional basketball players in the United States. I compare
the behavior of the data.table approach to the dplyr::group_nest()
function, one of the several powerful tidyverse nesting functions. Results
using bench::mark() show the speed and efficiency of using data.table to
work with list-columns.

Keywords: data.table, dplyr, list-columns, nesting

7 Introduction

8 The use of list-columns in data frames and tibbles in the R statistical environment (R
9 Core Team, 2018) provides a cognitively efficient way to organize complex data (e.g. several
10 statistical models, groupings of text, data summaries, or even graphics) with corresponding

Correspondence concerning this article should be addressed to Tyson S. Barrett, 2800 Old Main, Logan,
UT 84322. E-mail: [email protected]
LIST-COLUMNS IN DATA.TABLE 2

11 data in a concise manner. It has become a common approach to wrangling data in the
12 tidyverse, with functions across dplyr and tidyr providing functionality to work with
13 list-columns (Bryan, 2018; Wickham et al., 2019; Wickham & Henry, 2019). This format is
14 often called “nested” data, where information is, in essence, nested within a column of data.
15 For example, list-columns can be used to nest data regarding students within class-
16 rooms, players within teams, measures within individuals, and text within chapters. This
17 allows the user to do certain data manipulations within each group in a consistent, controlled
18 manner. This can ensure that accidentally including data from other groups does not occur.
19 Furthermore, nesting can reduce the difficulty to appropriately analyze the data stored in
20 the list-column. Using functions like lapply() or purrr::map*() makes further analysis of
21 the nested data more intuitive and error-free.
22 Because of its efficiency and speed, being able to use data.table to work with list-
23 columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data
24 sets, increase speed of calculations). Herein, I demonstrate how one can create list-columns
25 in a data table using the by argument in data.table (using a custom function) and the
26 purrr::map*() functions. I further highlight the dplyr::group_nest() function and show
27 a slightly more efficient approach when using a data table. Using bench::mark(), I assess
28 the speed and efficiency of using data.table to work with list-columns.
29 This article relies on several powerful R packages, including data.table, dplyr,
30 bench, tidyr, papaja, stringr, ggplot2, ggbeeswarm, ggrepel, performance, rvest,
31 and lobstr (Aust & Barth, 2018; Clarke & Sherrill-Mix, 2017; Dowle & Srinivasan, 2019;
32 Hester, 2019; Lüdecke, Makowski, & Waggoner, 2019; Slowikowski, 2018; Wickham, 2016,
33 2019c, 2019b, 2019a; Wickham et al., 2019; Wickham & Henry, 2019).

34 Example Data

35 Throughout much of this paper, I demonstrate the use of list-columns in data.table


36 using data from NBA Stuffer. These data are downloaded, providing information on players
37 from the 2017-2018 and 2018-2019 seasons. To do so, I first read in the HTML data, then
38 extract the tables with player data by year (using a short custom function), add indicators,
39 and then combine each into a single data table. Each step is shown in the code below.
url_2018 <- "https://www.nbastuffer.com/2017-2018-nba-player-stats/"
url_2019 <- "https://www.nbastuffer.com/2018-2019-nba-player-stats/"
players_2018 <- read_html(url_2018)
players_2019 <- read_html(url_2019)

extract_fun <- function(html){


tabs <- html_nodes(html, "table")[2] %>%
html_table(fill = TRUE)
tabs[[1]]
}

player_2018 <-
LIST-COLUMNS IN DATA.TABLE 3

extract_fun(players_2018) %>%
mutate(year = 2018,
AGE = as.numeric(AGE))
player_2019 <-
extract_fun(players_2019) %>%
mutate(year = 2019)

players <-
bind_rows(player_2018, player_2019) %>%
clean_names() %>%
rename(ppg = ppg_points_points_per_game,
apg = apg_assists_assists_per_game) %>%
data.table()

40 Below is a subset of this imported data set, showing only four of the variables and the first
41 six rows.

42 ## full_name team year mpg ppg apg


43 ## 1: Aaron Brooks Min 2018 5.9 2.3 0.6
44 ## 2: Aaron Gordon Orl 2018 32.9 17.6 2.3
45 ## 3: Aaron Harrison Dal 2018 25.9 6.7 1.2
46 ## 4: Aaron Jackson Hou 2018 34.5 8.0 1.0
47 ## 5: Abdel Nader Bos 2018 10.9 3.0 0.5
48 ## 6: Adreian Payne Orl 2018 8.5 4.2 0.0

49 Nesting with data.table

50 In dplyr, the group_nest() function is valuable when creating list-columns based


51 on a grouping variable. It takes the data by group and puts it all in a list-column. Figure
52 1 highlights the process of taking a data frame and creating a nested data frame with a
53 list-column. That is, all data from variables x, y, and z relating to each group is split into a
54 distinct data frame and stored within the data column.

55 Overall, this function is efficient and fast but—by relying on data.table—it can be
56 somewhat faster. This will be shown using the following function, which relies solely on the
57 syntax of data.table, using the j and by arguments as shown below.
group_nest_dt <- function(dt, ..., .key = "data"){
stopifnot(is.data.table(dt))

by <- substitute(list(...))

dt <- dt[, list(list(.SD)), by = eval(by)]


setnames(dt, old = "V1", new = .key)
dt
}
LIST-COLUMNS IN DATA.TABLE 4

58 First thing to note, is that in the data table, we create a list within a list containing
59 the .SD special object. This object is all the data in the data table except for the variables
60 that are in the by argument. The by argument, before being evaluated within the data
61 table, first becomes an unevaluated list of bare variable names and then evaluate it within
62 the data.table syntax. In essence, this function takes a data table, then creates a list of
63 the data table per group specified in the by argument.
head(group_nest_dt(players, team))

64 ## team data
65 ## 1: Min <data.table>
66 ## 2: Orl <data.table>
67 ## 3: Dal <data.table>
68 ## 4: Hou <data.table>
69 ## 5: Bos <data.table>
70 ## 6: Ind <data.table>

71 The syntax and output are nearly identical to the dplyr::group_nest() function
72 but has data tables in the list-column instead of tibbles.
head(group_nest(players, team))

73 ## # A tibble: 6 x 2
74 ## team data
75 ## <chr> <list>
76 ## 1 Atl <tibble [44 x 30]>
77 ## 2 Bos <tibble [37 x 30]>
78 ## 3 Bro <tibble [41 x 30]>
79 ## 4 Cha <tibble [34 x 30]>
80 ## 5 Chi <tibble [43 x 30]>
81 ## 6 Cle <tibble [49 x 30]>

Figure 1 . Diagram of one approach to creating a list-column in a data frame (i.e. creating a
nested data frame).
LIST-COLUMNS IN DATA.TABLE 5

Figure 2 . Speed comparisons for each nesting approach. Note the scale of the y-axis is log10 .

82 Given both perform very similar data manipulations, it is of interest to see if there
83 are differences in memory and speed performance. Figure 2 presents the timings from
84 bench::mark() across the two approaches, showing group_nest_dt() is often faster, al-
85 though differences for this size of data set are not meaningful. The memory allocated is
86 also very similar, with group_nest_dt() allocating 463KB and group_nest() allocating
87 335KB.
88 This nesting approach can be used with multiple grouping variables too. For example,
89 I show how a user could nest by both team and year, as is done below.
head(group_nest_dt(players, team, year))

90 ## team year data


91 ## 1: Min 2018 <data.table>
92 ## 2: Orl 2018 <data.table>
93 ## 3: Dal 2018 <data.table>
94 ## 4: Hou 2018 <data.table>
95 ## 5: Bos 2018 <data.table>
96 ## 6: Ind 2018 <data.table>

97 Analyses within the Nested Data

98 Often, the nested data can provide an intuitive format to run several analyses to
99 understand key features of the data within the groups. Below, the relationship between
100 points-per-game and assists-per-game for each team and year is modeled and then the R2
LIST-COLUMNS IN DATA.TABLE 6

101 of the models are extracted. Since performance::r2() provides two versions of R2 , I then
102 grab only the first of the two types.
players_nested <- group_nest_dt(players, team, year)
players_nested[, ppg_apg := purrr::map(data, ~lm(ppg ~ apg, data = .x))]
players_nested[, r2_list := purrr::map(ppg_apg, ~performance::r2(.x))]
players_nested[, r2_ppg_apg := purrr::map_dbl(r2_list, ~.x[[1]])]
head(players_nested)

103 ## team year data ppg_apg r2_list r2_ppg_apg


104 ## 1: Min 2018 <data.table> <lm> <r2_generic> 0.4662060
105 ## 2: Orl 2018 <data.table> <lm> <r2_generic> 0.4357684
106 ## 3: Dal 2018 <data.table> <lm> <r2_generic> 0.4305347
107 ## 4: Hou 2018 <data.table> <lm> <r2_generic> 0.6967150
108 ## 5: Bos 2018 <data.table> <lm> <r2_generic> 0.6043402
109 ## 6: Ind 2018 <data.table> <lm> <r2_generic> 0.6060465

110 This produces two list-columns (ppg_apg and r2_list) and a numeric vector (r2_ppg_apg)
111 all organized by team and year. This information is then readily available to plot. For
112 example, one can look at how related points-per-game and assists-per-game are by team
113 and year—in essence, showing which teams have players who both score and assist. The
114 example plot is shown in Figure 3.
players_nested %>%
dcast(team ~ year, value.var = "r2_ppg_apg") %>%
ggplot(aes(`2018`, `2019`, group = team)) +
geom_point() +
geom_text_repel(aes(label = team)) +
geom_abline(slope = 1) +
coord_fixed(ylim = c(0,1),
xlim = c(0,1))

115 Unnesting with data.table

116 After performing the manipulations or analyses within the nest, it can often be
117 necessary to unnest to finalize analyses. Again, like with group_nest_dt(), the unnest_dt()
118 function below relies solely on the syntax of data.table, using the j and by arguments as
119 shown below.
unnest_dt <- function(dt, col, id){
stopifnot(is.data.table(dt))

by <- substitute(id)
col <- substitute(unlist(col, recursive = FALSE))

dt[, eval(col), by = eval(by)]


}
LIST-COLUMNS IN DATA.TABLE 7

Figure 3 . Example analysis performed using nested data to provide information for each
team and year.

120 This function can be used to unnest a data table, like the players_nested data table
121 from earlier, where the nested column can be a data table, data frame, or tibble. Below, the
122 data column in the table is unnested by team and year and then a few of the variables are
123 selected for demonstration purposes.
players_unnested <- unnest_dt(players_nested,
col = data,
id = list(team, year))
players_unnested[, .(team, year, full_name, pos, age, gp, mpg)]

124 ## team year full_name pos age gp mpg


125 ## 1: Min 2018 Aaron Brooks PG 33.00 32 5.9
126 ## 2: Min 2018 Andrew Wiggins SF 22.00 82 36.3
127 ## 3: Min 2018 Anthony Brown SF 25.00 1 3.7
128 ## 4: Min 2018 Cole Aldrich C 29.00 21 2.3
129 ## 5: Min 2018 Derrick Rose PG 29.00 9 12.4
130 ## ---
131 ## 1223: Det 2019 Svi Mykhailiuk G 21.84 3 6.6
132 ## 1224: Det 2019 Zaza Pachulia C 35.16 68 12.9
133 ## 1225: Det 2019 Glenn Robinson III G-F 25.26 47 13.0
134 ## 1226: Det 2019 Ish Smith G 30.77 56 22.3
135 ## 1227: Det 2019 Khyri Thomas G 22.92 26 7.5
136 Again, this function is quick and efficient. Figure 4 presents the timings from
137 bench::mark() across the two unnesting approaches, showing the data.table approach is
138 much faster. The memory allocated is about half for the data.table approach here, with
139 unnest_dt() allocating 912KB and tidyr::unnest() allocating 1.83MB.
LIST-COLUMNS IN DATA.TABLE 8

Figure 4 . Speed comparisons for each unnesting approach. Note the scale of the y-axis is
log10 .

140 Unnesting Vectors with data.table

141 A slight variation of this function can be used for list-columns with atomic vectors
142 instead of data tables. A function like the following works well.
unnest_vec_dt <- function(dt, cols, id, name){
stopifnot(is.data.table(dt))

by <- substitute(id)
cols <- substitute(unlist(cols,
recursive = FALSE))

dt <- dt[, eval(cols), by = eval(by)]


setnames(dt, old = paste0("V", 1:length(name)), new = name)
dt
}

143 In players_nested, the r2_list column is a list of numeric vectors. This can be
144 unnested as shown below, providing the two measures of R2 per team per year.
unnest_vec_dt(players_nested,
cols = list(r2_list),
id = list(team, year),
name = "r2")
LIST-COLUMNS IN DATA.TABLE 9

145 ## team year r2


146 ## 1: Min 2018 0.466206
147 ## 2: Min 2018 0.4280779
148 ## 3: Orl 2018 0.4357684
149 ## 4: Orl 2018 0.4025783
150 ## 5: Dal 2018 0.4305347
151 ## ---
152 ## 116: Lac 2019 0.4808586
153 ## 117: Phi 2019 0.4342685
154 ## 118: Phi 2019 0.4106964
155 ## 119: Det 2019 0.5740963
156 ## 120: Det 2019 0.550435

157 Memory Usage of List-Columns

158 Last item to demonstrate herein is the computer memory usage of different formats of
159 data tables with the same data. We can use the following large data sets in wide format,
160 nested wide format, long format, and nested wide format to make brief comparisons.
# Wide
wide_format <- data.table(id = 1:1e6,
x1 = rnorm(1e6),
x2 = rnorm(1e6),
y1 = rnorm(1e6),
y2 = rnorm(1e6),
group = rbinom(1e6, 1, .5))
nested_wide_format <- group_nest_dt(wide_format, group)

# Long
long_format <- melt.data.table(wide_format,
id.vars = c("id", "group"),
measure.vars = c("x1", "x2", "y1", "y2"))
nested_long_format <- group_nest_dt(long_format, group)

161 I use the lobstr package to assess the object size of each format of the same data,
162 shown in Table 1. Not surprising, the memory usage of nested data is lower than for its
163 none nested corresponding data. This is directly related to the reduction in redundancies in
164 the data otherwise there. That is, the nested data has far fewer rows containing the group
165 variable. That, alone, in this large data saves memory. For example, the size of a single
166 column of the group variable in wide format is 4 MB; and in long format it is 16 MB By
167 reducing a single variable in this case, we save several megabytes of memory.

168 Discussion

169 List-columns are a useful approach to structuring data into a format that can be
170 safely cleaned, manipulated, and analyzed by groups. It also provides for a more cognitively
LIST-COLUMNS IN DATA.TABLE 10

Table 1
Memory usage for each format of the same data
Format Memory (MB)
Wide Format 40.0
Nested Wide Format 36.0
Long Format 80.0
Nested Long Format 64.0

171 efficient way for a user to understand their data, allowing large data to be represented more
172 concisely within groups.
173 The tidyverse provides several functions to work with nested data, which are relatively
174 quick and efficient. For most data situations, these functions will do all that a user will need.
175 However, in some situations, data.table can perform needed manipulations and analyses
176 that cannot otherwise be done or that would take too long to complete. In these situations,
177 and for users that prefer to use data.table, this tutorial can help provide direction in using
178 list-columns.
179 Furthermore, as expected, the memory usage of nested data is lower than for its none
180 nested corresponding data. This is due to the reduction in the redundancies present in wide
181 and long format. This suggests that it is not only the cognitive benefits to the user that
182 makes this format more efficient.

183 Limitations

184 There are some notable limitations to list-columns in general, and in data.table
185 specifically. First, the three custom functions built on data.table presented herein are not
186 well-tested and are certainly not expected to work in each case where dplyr::group_nest(),
187 tidyr::unnest(), and other tidy functions would work. Rather, they were presented to
188 show how a user can leverage the speed and efficiency of data.table to create, and work
189 with, list-columns.
190 Second, it is important to realize that nested data can remove the ability to use
191 vectorization across groups. Depending on the analyses being conducted, this may slow
192 down the computation to the point that nested data actually is a hindrance to performance.
193 Finally, when using list-columns in tibbles, the print method provides the dimensions
194 of each corresponding nested tibble. This method is helpful in understanding the nested
195 data without any need to extract it. This could be a minor, but valuable, update to the
196 print method in data.table.

197 Conclusions

198 The use of list-columns in data.table is very similar to that in the tidyverse. It
199 provides speed and efficiency in both nesting and unnesting the data, and can be used with
200 the purrr::map*() and other powerful functions.
LIST-COLUMNS IN DATA.TABLE 11

201 References
202 Aust, F., & Barth, M. (2018). papaja: Create APA manuscripts with R Markdown. Retrieved
203 from https://github.com/crsh/papaja
204 Bryan, J. (2018). List columns (as part of "purrr tutorial"). Retrieved from https://jennybc.
205 github.io/purrr-tutorial/ls13_list-columns.html
206 Clarke, E., & Sherrill-Mix, S. (2017). Ggbeeswarm: Categorical scatter (violin point) plots.
207 Retrieved from https://github.com/eclarke/ggbeeswarm
208 Dowle, M., & Srinivasan, A. (2019). Data.table: Extension of ‘data.frame‘. Retrieved from
209 https://CRAN.R-project.org/package=data.table
210 Hester, J. (2019). Bench: High precision timing of r expressions. Retrieved from https:
211 //CRAN.R-project.org/package=bench
212 Lüdecke, D., Makowski, D., & Waggoner, P. (2019). Performance: Assessment of regression
213 models performance. Retrieved from https://easystats.github.io/performance/
214 R Core Team. (2018). R: A language and environment for statistical computing. Vienna,
215 Austria: R Foundation for Statistical Computing. Retrieved from https://www.
216 R-project.org/
217 Slowikowski, K. (2018). Ggrepel: Automatically position non-overlapping text labels with
218 ’ggplot2’. Retrieved from https://CRAN.R-project.org/package=ggrepel
219 Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis. Springer-Verlag New
220 York. Retrieved from https://ggplot2.tidyverse.org
221 Wickham, H. (2019a). Lobstr: Visualize r data structures with trees. Retrieved from
222 https://CRAN.R-project.org/package=lobstr
223 Wickham, H. (2019b). Rvest: Easily harvest (scrape) web pages. Retrieved from https:
224 //CRAN.R-project.org/package=rvest
225 Wickham, H. (2019c). Stringr: Simple, consistent wrappers for common string operations.
226 Retrieved from https://CRAN.R-project.org/package=stringr
227 Wickham, H., François, R., Henry, L., & Müller, K. (2019). Dplyr: A grammar of data
228 manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
229 Wickham, H., & Henry, L. (2019). Tidyr: Tidy messy data. Retrieved from https://CRAN.
230 R-project.org/package=tidyr

You might also like