You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two calls to model.frame where the terms attributes are desired, one in dummy.R (prep.step_dummy) and one in misc.R (get_rhs_vars). My understanding is that when only the term info is desired, you can do the model.frame on the first row and achieve equivalent results. This will lead to improvements in processing time and memory consumption particularly for large datasets. The following example shows that the term info is the same, but processing time and memory use are greatly reduced. This likely will improve the performance of almost all recipes.
library(recipes)
#> Loading required package: dplyr#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, union#> #> Attaching package: 'recipes'#> The following object is masked from 'package:stats':#> #> step
library(bench)
n<-5e6nsub<-5
set.seed(1)
dat<- tibble(
num_col= rnorm(n),
int_col= sample.int(1:50, n, replace=TRUE),
char_col= rep(paste('a', 1:nsub), each=n/nsub),
fac_col=factor(rep(paste('a', 1:nsub), each=n/nsub)),
dt_col= as.POSIXct('2010-01-01', tz='UTC') +1:n)
form<- as.formula(num_col~.)
bench::mark(
attr(model.frame(form, dat), 'terms'),
attr(model.frame(form, dat[1,]), 'terms')
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.#> # A tibble: 2 x 6#> expression min median `itr/sec` mem_alloc#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>#> 1 attr(model.frame(form, dat), "terms") 616ms 616ms 1.62 903MB#> 2 attr(model.frame(form, dat[1, ]), "terms") 357µs 370µs 2049. 126KB#> # … with 1 more variable: gc/sec <dbl>
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
Feature
There are two calls to
model.frame
where the terms attributes are desired, one in dummy.R (prep.step_dummy
) and one in misc.R (get_rhs_vars
). My understanding is that when only the term info is desired, you can do the model.frame on the first row and achieve equivalent results. This will lead to improvements in processing time and memory consumption particularly for large datasets. The following example shows that the term info is the same, but processing time and memory use are greatly reduced. This likely will improve the performance of almost all recipes.Created on 2021-06-14 by the reprex package (v2.0.0)
Session info
The text was updated successfully, but these errors were encountered: