Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvement for model.frame terms attributes #726

Closed
jkennel opened this issue Jun 15, 2021 · 4 comments
Closed

Speed improvement for model.frame terms attributes #726

jkennel opened this issue Jun 15, 2021 · 4 comments
Labels
feature a feature request or enhancement

Comments

@jkennel
Copy link
Contributor

jkennel commented Jun 15, 2021

Feature

There are two calls to model.frame where the terms attributes are desired, one in dummy.R (prep.step_dummy) and one in misc.R (get_rhs_vars). My understanding is that when only the term info is desired, you can do the model.frame on the first row and achieve equivalent results. This will lead to improvements in processing time and memory consumption particularly for large datasets. The following example shows that the term info is the same, but processing time and memory use are greatly reduced. This likely will improve the performance of almost all recipes.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(bench)

n <- 5e6
nsub <- 5
set.seed(1)

dat <- tibble(
  num_col  = rnorm(n), 
  int_col  = sample.int(1:50, n, replace = TRUE),
  char_col = rep(paste('a', 1:nsub), each =n / nsub),
  fac_col  = factor(rep(paste('a', 1:nsub), each =n / nsub)),
  dt_col   = as.POSIXct('2010-01-01', tz = 'UTC') + 1:n)

form <- as.formula(num_col~.)

bench::mark(
  attr(model.frame(form, dat), 'terms'),
  attr(model.frame(form, dat[1,]), 'terms')
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                     min  median `itr/sec` mem_alloc
#>   <bch:expr>                                 <bch:t> <bch:t>     <dbl> <bch:byt>
#> 1 attr(model.frame(form, dat), "terms")        616ms   616ms      1.62     903MB
#> 2 attr(model.frame(form, dat[1, ]), "terms")   357µs   370µs   2049.       126KB
#> # … with 1 more variable: gc/sec <dbl>

Created on 2021-06-14 by the reprex package (v2.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Pop!_OS 20.04 LTS           
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_CA:en                    
#>  collate  en_CA.UTF-8                 
#>  ctype    en_CA.UTF-8                 
#>  tz       America/Toronto             
#>  date     2021-06-14                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source        
#>  assertthat    0.2.1       2019-03-21 [1] CRAN (R 4.1.0)
#>  backports     1.2.1       2020-12-09 [1] CRAN (R 4.1.0)
#>  bench       * 1.1.1       2020-01-13 [1] CRAN (R 4.1.0)
#>  class         7.3-19      2021-05-03 [4] CRAN (R 4.0.5)
#>  cli           2.5.0       2021-04-26 [1] CRAN (R 4.1.0)
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 4.1.0)
#>  DBI           1.1.1       2021-01-15 [1] CRAN (R 4.1.0)
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.1.0)
#>  dplyr       * 1.0.6       2021-05-05 [1] CRAN (R 4.1.0)
#>  ellipsis      0.3.2       2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi         0.5.0       2021-05-25 [1] CRAN (R 4.1.0)
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.1.0)
#>  generics      0.1.0       2020-10-31 [1] CRAN (R 4.1.0)
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.1.0)
#>  gower         0.2.2       2020-06-23 [1] CRAN (R 4.1.0)
#>  highr         0.9         2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools     0.5.1.1     2021-01-22 [1] CRAN (R 4.1.0)
#>  ipred         0.9-11      2021-03-12 [1] CRAN (R 4.1.0)
#>  knitr         1.33        2021-04-24 [1] CRAN (R 4.1.0)
#>  lattice       0.20-44     2021-05-02 [4] CRAN (R 4.1.0)
#>  lava          1.6.9       2021-03-11 [1] CRAN (R 4.1.0)
#>  lifecycle     1.0.0       2021-02-15 [1] CRAN (R 4.1.0)
#>  lubridate     1.7.10      2021-02-26 [1] CRAN (R 4.1.0)
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.1.0)
#>  MASS          7.3-54      2021-05-03 [4] CRAN (R 4.0.5)
#>  Matrix        1.3-4       2021-06-01 [4] CRAN (R 4.1.0)
#>  nnet          7.3-16      2021-05-03 [4] CRAN (R 4.0.5)
#>  pillar        1.6.1       2021-05-16 [1] CRAN (R 4.1.0)
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.1.0)
#>  prodlim       2019.11.13  2019-11-17 [1] CRAN (R 4.1.0)
#>  profmem       0.6.0       2020-12-13 [1] CRAN (R 4.1.0)
#>  purrr         0.3.4       2020-04-17 [1] CRAN (R 4.1.0)
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.1.0)
#>  Rcpp          1.0.6       2021-01-15 [1] CRAN (R 4.1.0)
#>  recipes     * 0.1.16.9000 2021-06-15 [1] local         
#>  reprex        2.0.0       2021-04-02 [1] CRAN (R 4.1.0)
#>  rlang         0.4.11      2021-04-30 [1] CRAN (R 4.1.0)
#>  rmarkdown     2.8         2021-05-07 [1] CRAN (R 4.1.0)
#>  rpart         4.1-15      2019-04-12 [4] CRAN (R 4.0.0)
#>  rstudioapi    0.13        2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.1.0)
#>  stringi       1.6.2       2021-05-17 [1] CRAN (R 4.1.0)
#>  stringr       1.4.0       2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.4.1       2021-03-30 [1] CRAN (R 4.1.0)
#>  survival      3.2-11      2021-04-26 [4] CRAN (R 4.0.5)
#>  tibble        3.1.2       2021-05-16 [1] CRAN (R 4.1.0)
#>  tidyselect    1.1.1       2021-04-30 [1] CRAN (R 4.1.0)
#>  timeDate      3043.102    2018-02-21 [1] CRAN (R 4.1.0)
#>  utf8          1.2.1       2021-03-12 [1] CRAN (R 4.1.0)
#>  vctrs         0.3.8       2021-04-29 [1] CRAN (R 4.1.0)
#>  withr         2.4.2       2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun          0.23        2021-05-15 [1] CRAN (R 4.1.0)
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.1.0)
#> 
#> [1] /home/jonathankennel/R/x86_64-pc-linux-gnu-library/4.1
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
@topepo topepo added the feature a feature request or enhancement label Jun 15, 2021
@topepo
Copy link
Member

topepo commented Jun 15, 2021

It's a great observation. Thanks. Would you be interested in making a PR?

@jkennel
Copy link
Contributor Author

jkennel commented Jun 15, 2021

Sure. I'll do this today.

@juliasilge
Copy link
Member

Closed in #727

@github-actions
Copy link

github-actions bot commented Jul 4, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants