The Joy of Sweave: A Beginner's Guide To Reproducible Research With Sweave
The Joy of Sweave: A Beginner's Guide To Reproducible Research With Sweave
The Joy of Sweave: A Beginner's Guide To Reproducible Research With Sweave
Mario Pineda-Krch Centre for Mathematical Biology, University of Alberta http://pineda-krch.com January 17, 2011
The Joy of Sweave: A Beginners Guide to Reproducible Research with Sweave by Mario Pineda-Krch is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Reproducible Research
article about computational science in a scientic publication is not the scholarship itself, An it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the gures. Jon Claerbout
Reproducibility is one of the cornerstones of the modern scientic method Traditionally papers publishing experimental and mathematical results contain sucient information to reproduce the results, e.g. empirical methods or mathematical proofs Reproducible Research in computational sciences is about reproducible computational results, e.g. simulation and analysis results (not reproducing experimental results), using the same methods (algorithms, seed) as in the paper The majority of computational research is not easily reproducible because Algorithms are typically not described in published papers
The Joy of Sweave by Mario Pineda-Krch
Journals do not require computer code to be deposited in a repository (see e.g. Dryad (http://datadryad.org/), KNB (http://knb.ecoinformatics.org/), and GenBank (http://www.ncbi.nlm.nih.gov/genbank/)) The code/documentation mismatch fallacy Sweave provides a solution to this problem
A What is LTEX?
LTEX hard to use? Its easy to use if youre one of the 2% of the population who thinks Is A logically and can read an instructional manual. The other 98% of the population would nd it very hard or impossible to use. Leslie Lamport
A document preparation system for high-quality typesetting. First developed in 1985 by Leslie Lamport and based on Donald E. Knuths TeX typesetting language. Designed for the production of technical and scientic documentation. Based on the idea that it is better to leave document design to document designers, and to let authors get on with writing documents. Automatic generation of bibliographies and indexes. Pronounced Lah-tech or Lay-tech .
The Joy of Sweave by Mario Pineda-Krch
10
The Comprehensive TEX Archive Network aka CTAN: the authoritative collection of materials related to the TeX typesetting system. (http://www.ctan.org/)
A The Not So Short Introduction to LTEX2 : The unocial manual. (http://www.ctan. org/tex-archive/info/lshort/english/lshort.pdf) A A LTEX at Wikibooks: wiki guide to the LTEX wikibooks.org/wiki/LaTeX)
markup language.
(http://en.
A The comprehensive LTEX symbol list: lists 2826 symbols and the corresponding A LTEXcommands and packages necessary to produce them. (http://statweb. calpoly.edu/jdoi/web/reference/symbols-a4.pdf)
11
12
The non-dimensional version of the Hastings & Powell (1991) model is given by, dx = x(1 x) a1 x y dt 1 + b1 x dy dt dz dt ay ax = 1 +1b x y 1 +2b y z d1 y 1 2 ay = 1 +2b y z d2 z 2 (1)
13
What is R?
is a language and environment for statistical computing computational research and R graphics. Freely adapted from http://www.r-project.org/about.html
Highly extensible via user-developed packages; 2751 of packages (only at CRAN) and counting. Much of R is written in R. Command line interface and scriptable. Easily integrates with low-level languages (e.g. C/C++ and Fortran). Source code freely available allowing for algorithm transparency and modication. Free as in freedom and priceless. Compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
The Joy of Sweave by Mario Pineda-Krch
14
R-Forge (https://r-forge.r-project.org/): central platform for the development of R packages, R-related software and further projects. Stack Overow (http://stackoverflow.com/questions/tagged/r): collaboratively built and maintained programming Q&A site. Journal of Statistical Software (http://www.jstatsoft.org/): publishes articles, book reviews, code snippets, and software. Lots of R stu. Open source (i.e. free access). See in particular the Special Volume on Ecology and Ecological Modelling in R (http://www.jstatsoft.org/v22).
The Joy of Sweave by Mario Pineda-Krch
15
The R Journal (http://journal.r-project.org/): peer reviewed journal focusing on introduction and review of packages, R programming tips and tricks, etc. GillespieSSA (http://pineda-krch.com/gillespiessa/): a package providing an interface to several stochastic simulation algorithms for generating simulated trajectories of nite population continuous-time model.
16
17
A R + LTEX= Sweave
Sweave provides a exible framework for mixing text and S code for automatic document generation. A single source le contains both documentation text and S code, which are then woven into a nal document containing the documentation text together with the S code and/or the output of the code (text, graphs) Freidrich Leisch
A function in R
A Allows for integration of code (R) with prose (LTEX)
A simple text le consisting of a sequence of code and documentation segment, aka chunks
A Extremely simple syntax once you know R and LTEX learning Sweave is trivial
Enables the creation of dynamic documents R code is executed and the results (output, graphics) incorporated when the document is generated
The Joy of Sweave by Mario Pineda-Krch
18
Easy to regenerate the re-run the code and regenerate the documentation if the inout changes Can make computational research more transparent and reproducible to others and to oneself
19
20
Enclose R code chunks between <<>>= (on a line of its own) and @ (on a line of its own) To produce the documentation weave the source like so: R CMD Sweave foo.Rnw from the shell Sweave(foo.Rnw) from within R.
21
22
x <- population [1] y <- population [2] z <- population [3] with ( as . list ( parms ) ,{ dx = x *(1 - x ) - ( a1 * x ) /(1+ b1 * x ) * y dy = ( a1 * x ) /(1+ b1 * x ) * y - ( a2 * y ) /(1+ b2 * y ) * z - d1 * y dz = ( a2 * y ) /(1+ b2 * y ) * z - d2 * z out = c ( dx , dy , dz ) list ( out ) }) } @ I eyeballed the initial conditions from Figure 2 in HP91 . < < > >= x0 <- c ( x =0.75 , y =0.15 , z =10) @ Declare the time vector , < < > >= time <- seq (0 , 5000) @
23
Here I am using the same parameters as in Figure 2 in HP91 ( which is identical to Figure 2 in KH94 ) , < < > >= parms <- c ( a1 =5.0 , b1 =2.5 , a2 =0.1 , b2 =2.0 , d1 =0.4 , d2 =0.01) @ Solve the system numerically , < < > >= require ( deSolve ) out <- as . data . frame ( ode ( x0 , time , hp , parms ) ) @ Look at the structure of the result object , < < > >= str ( out ) @ and the beginning of the time series , < < > >= head ( out ) @
24
\ end { document }
25
Weaving
The non-dimensional version of the Hastings & Powell (1991) model is given by, dx = x(1 x) a1 x y dt 1 + b1 x dy dt ay ax = 1 +1b x y 1 +2b y z d1 y 1 2 (1)
a2 y dz = dt 1 + b2 y z d2 z Dene the nondimensional system in R, > hp = function(time, population, parms) + x <- population[1] + y <- population[2] + z <- population[3] + with(as.list(parms), { + dx = x * (1 - x) - (a1 * x)/(1 + dy = (a1 * x)/(1 + b1 * x) * y + z - d1 * y + dz = (a2 * y)/(1 + b2 * y) * z + out = c(dx, dy, dz) + list(out) + }) + } {
+ b1 * x) * y - (a2 * y)/(1 + b2 * y) * - d2 * z
26
> x0 <- c(x = 0.75, y = 0.15, z = 10) Declare the time vector, > time <- seq(0, 5000) Here I am using the same parameters as in Figure 2 in HP91 (which is identical to Figure 2 in KH94), > parms <- c(a1 = 5, b1 = 2.5, a2 = 0.1, b2 = 2, d1 = 0.4, d2 = 0.01) Solve the system numerically, > require(deSolve) > out <- as.data.frame(ode(x0, time, hp, parms)) Look at the structure of the result object, > str(out) 'data.frame': $ time: num $ x : num $ y : num $ z : num 5001 obs. of 4 variables: 0 1 2 3 4 5 6 7 8 9 ... 0.75 0.732 0.696 0.644 0.578 ... 0.15 0.173 0.201 0.234 0.267 ... 10 10 10.1 10.1 10.2 ...
and the beginning of the time series, > head(out) time 0 1 2 3 4 5 x 0.7500000 0.7318531 0.6959543 0.6441029 0.5779474 0.5021714 y 0.1500000 0.1729883 0.2013125 0.2339504 0.2674447 0.2942321 z 10.00000 10.02179 10.05784 10.10971 10.17767 10.25963
1 2 3 4 5 6
27
Tangling
Stangle(foo.Rnw) produces the R code
################################################### ### chunk number 1: ################################################### # line 19 " listing3 . Rnw " hp = function ( time , population , parms ) { x <- population [1] y <- population [2] z <- population [3] with ( as . list ( parms ) ,{ dx = x *(1 - x ) - ( a1 * x ) /(1+ b1 * x ) * y dy = ( a1 * x ) /(1+ b1 * x ) * y - ( a2 * y ) /(1+ b2 * y ) * z - d1 * y dz = ( a2 * y ) /(1+ b2 * y ) * z - d2 * z out = c ( dx , dy , dz ) list ( out ) }) }
###################################################
The Joy of Sweave by Mario Pineda-Krch
28
### chunk number 2: ################################################### # line 35 " listing3 . Rnw " x0 <- c ( x =0.75 , y =0.15 , z =10)
################################################### ### chunk number 3: ################################################### # line 40 " listing3 . Rnw " time <- seq (0 , 5000)
################################################### ### chunk number 4: ################################################### # line 45 " listing3 . Rnw " parms <- c ( a1 =5.0 , b1 =2.5 , a2 =0.1 , b2 =2.0 , d1 =0.4 , d2 =0.01)
29
################################################### # line 50 " listing3 . Rnw " require ( deSolve ) out <- as . data . frame ( ode ( x0 , time , hp , parms ) )
################################################### ### chunk number 6: ################################################### # line 56 " listing3 . Rnw " str ( out )
################################################### ### chunk number 7: ################################################### # line 61 " listing3 . Rnw " head ( out )
30
Adding gures
\ begin { figure }[! h ] \ begin { center } << fig = TRUE , width =7 , height =7 > >= plot ( y ~x , data = out , cex =.5 , pch =19 , xlab = expression ( italic ( x ) ) , ylab = expression ( italic ( y ) ) ) @ \ end { center } \ caption { Phase plane plot of the resource ( $x$ ) and the predator $y$ . } \ end { figure }
31
The non-dimensional version of the Hastings & Powell (1991) model is given by, dx = x(1 x) a1 x y dt 1 + b1 x dy dt ay ax = 1 +1b x y 1 +2b y z d1 y 1 2 (1)
a2 y dz = dt 1 + b2 y z d2 z Dene the nondimensional system in R, > hp = function(time, population, parms) + x <- population[1] + y <- population[2] + z <- population[3] + with(as.list(parms), { + dx = x * (1 - x) - (a1 * x)/(1 + dy = (a1 * x)/(1 + b1 * x) * y + z - d1 * y + dz = (a2 * y)/(1 + b2 * y) * z + out = c(dx, dy, dz) + list(out) + }) + } {
+ b1 * x) * y - (a2 * y)/(1 + b2 * y) * - d2 * z
32
> x0 <- c(x = 0.75, y = 0.15, z = 10) Declare the time vector, > time <- seq(0, 5000) Here I am using the same parameters as in Figure 2 in HP91 (which is identical to Figure 2 in KH94), > parms <- c(a1 = 5, b1 = 2.5, a2 = 0.1, b2 = 2, d1 = 0.4, d2 = 0.01) Solve the system numerically, > require(deSolve) > out <- as.data.frame(ode(x0, time, hp, parms)) Look at the structure of the result object, > str(out) 'data.frame': $ time: num $ x : num $ y : num $ z : num 5001 obs. of 4 variables: 0 1 2 3 4 5 6 7 8 9 ... 0.75 0.732 0.696 0.644 0.578 ... 0.15 0.173 0.201 0.234 0.267 ... 10 10 10.1 10.1 10.2 ...
and the beginning of the time series, > head(out) time 0 1 2 3 4 5 x 0.7500000 0.7318531 0.6959543 0.6441029 0.5779474 0.5021714 y 0.1500000 0.1729883 0.2013125 0.2339504 0.2674447 0.2942321 z 10.00000 10.02179 10.05784 10.10971 10.17767 10.25963
1 2 3 4 5 6
33
> plot(y ~ x, data = out, cex = 0.5, pch = 19, xlab = expression(italic(x)), + ylab = expression(italic(y)))
0.30
0.25 0.20
0.4
0.5
0.6 x
0.7
0.8
0.9
Figure 1: Phase plane plot of the resource (x) and the predator y.
34
Congratulations you have now reproduced the result in Hastings & Powell (1991)
Eh? Whats going on Alan? Try setting z = 0 (left as an exercise for the reader)
35
Inline evaluations
< < > >= a <- a @ We can print parameter values like this , $a =\ Sexpr { a } $ .
36
37
Caveat: do not forget to remove foo.data if you intend to recreate it (e.g. if the time consuming algorithm has changed).
The Joy of Sweave by Mario Pineda-Krch
38
Avoid monolithic Sweave les split into smaller logical components Use a Makele for build a project that consists of multiple Sweave les (http://www. stat.auckland.ac.nz/~stat782/downloads/make-tutorial.pdf)
39
Makele
An embarrassingly trivial Makele,
foo . pdf : foo . tex pdflatex foo . tex ; pdflatex foo . tex foo . tex : foo . Rnw R CMD Sweave foo . Rnw foo . R : foo . Rnw R CMD Stangle foo . Rnw clean : rm - rf *. eps *. pdf *. tex *. log *. aux
40
make examples
To generate documentation run,
make
A To weave (generate LTEX source) run,
41
Session information
<< results = tex > >= toLatex ( sessionInfo () ) @
42
Acknowledgements
Alan Hastings for pointing out the virtues of z = 0 The Lewis Research Group for feedback and discussion Thank you for your time! These slides, foo.Rnw and the associated Makefile are available at http:// pineda-krch.com/2011/01/17/the-joy-of-sweave/
The Joy of Sweave: A Beginners Guide to Reproducible Research with Sweave by Mario Pineda-Krch is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.