Active Documents With Org-Mode
Active Documents With Org-Mode
Active Documents
with Org-Mode
By Eric Schulte and Dan Davison
Org-mode is a simple, plain-text markup language for hierarchical documents that allows the intermingling
of data, code, and prose.
O
rg-mode is implemented as a mechanisms for evaluating embedded The ellipses at the end of each line
part of the Emacs text editor.1 code, and publishing functionality indicate that the heading’s content
It was initially developed as that might be used to automate the is hidden from view. Notice that the
a simple outlining tool intended for computational analysis and genera- heading beginning with the keyword
note taking and brainstorming. It was tion of figures. Here, we focus on the COMMENT is not included in the ex-
later augmented with task manage- Org-mode features that support the ported document. Org-mode uses
ment tools—letting researchers trans- practice of RR; information on other many such keywords for associating
form notes into tasks with deadlines aspects of Org-mode can be found information with headlines.
and priorities—and with syntax for in the manual (http://orgmode.org/
the inclusion of tables, data blocks, manual)4 and in the community wiki Code and Data
and active code blocks. Users new to (http://orgmode.org/worg). Using a simple block syntax, both
Org-mode often start with its simple The plain text Org-mode source of code and data can be embedded in
plain-text note-taking system, then this article is available for download Org-mode documents as follows:
move on to increasingly sophisti- at https://github.com/eschulte/CiSE/
cated features as their comfort level raw/master/org-mode-active-doc.org First a data block.
permits. (see the sidebar “How to Download this #+begin_example
In reproducible research (RR), re- Document” for more information). raw textual data
searchers publish scientific results Readers with the requisite open-source #+end_example
along with the software environment software can execute the source code
and data required to reproduce all examples—which analyze a dataset Second a code block.
computational analyses in the publi- and create graphics—as well as export #+begin_src sh
cation.2 Reproducibility is essential the complete paper to one of several echo "shell script code"
to peer-reviewed research, but sci- output formats. #+end_src
entific publications often lack the in-
formation required for reviewers to Syntax Code and data blocks can be named,
reproduce the analysis described in Org-mode documents are plain text allowing their contents to be refer-
the work. As Jonathan Buckheit and files organized using a hierarchical enced from elsewhere in the Org-
David L. Donoho noted,3 outline defined by a number of simple mode file. Figure 2 shows an example,
syntactical rules. in which the shell script references the
An article about computational sci- data block’s content.
ence in a scientific publication is not Outlines Cross references between an Org-
the scholarship itself, it is merely The outline can be folded and ex- mode file’s code and data elements turn
advertising of the scholarship. The panded, hiding or exposing as much Org-mode into a powerful, multilingual
actual scholarship is the complete of the document as wanted. Using this programming environment in which
software development environment facility, even very large documents data and code expressed in many differ-
and complete set of instructions, can be comfortably navigated in a ent programming languages can interact.
which generated the figures. manner similar to that of a file system.
Headlines are indicated by leading *’s, Evaluation
Org-mode supports RR with syntax as in the folded view of this article in Code and data references make
for including inline data and code, Figure 1. chained evaluation strings possible.
#+source: configuration
#+begin_src emacs-lisp :results silent
;; first it is necessary to ensure that Org-mode loads support for the
;; languages used by code blocks in this article
(org-babel-do-load-languages
'org-babel-load-languages
'((sh . t)
(org . t)
(emacs-lisp . t)
(python . t)
(R . t)
(gnuplot . t)))
;; then we'll remove the need to confirm evaluation of each code
;; block, NOTE: if you are concerned about execution of malicious code
;; through code blocks, then comment out the following line
(setq org-confirm-babel-evaluate nil)
;; finally we'll customize the default behavior of Org-mode code blocks
;; so that they can be used to display examples of Org-mode syntax
(setf org-babel-default-header-args:org '((:exports . "code")))
#+end_src
Figure A. The emacs-lisp code block to configure Org-mode to export this article.
Figure 3 shows the series of actions 2. To resolve this reference, the evaluated as a literal value that’s
that result when the analyze code data code block is located in the assigned to the url variable and
block is evaluated interactively or Org-mode file and is evaluated. passed to the shell script. The
during export. 3. The :var raw=raw header argu- shell script then downloads data
ment causes Org-mode to resolve from the external url and makes
1. The analyze code block is evalu- the raw reference. these data available to Org-mode.
ated. The :var data=data head- 4. The raw code block is evaluated 5. The results of the shell script
er argument causes Org-mode to causing the :var url=http:// are assigned to the raw variable,
evaluate the data reference. data.org header argument to be which is passed to the Python
May/June 2011 3
* Introduction...
* Syntax...
** Outlines...
** Code and Data...
* Evaluation...
* Example Application...
** Download External Data...
** Parsing... Major League Baseball (MLB) games
** Analysis... in the 2010 season. We hypothesize
** Display... what every baseball fan wants to
* Conclusion... believe: that large crowds spur the
* COMMENT How to Export this Document... home team to superior performance
* Footnotes... levels. We found and report on
the offensive statistic that has the larg-
Figure 1. The folded view of this article. Headlines are indicated by leading *’s.
est correlation with high attendance.
Export
Figure 3. Active Org-mode document. Variables of the analyze code block reference the results of previous code blocks
(shown of the right), in resolving these references the referenced code blocks are evaluated, and their results are passed back
to the analyze code block (on the left side).
May/June 2011 5
#+source: url
#+begin_src sh :var season=season :exports none
echo "http://www.retrosheet.org/gamelogs/gl$season.zip"
#+end_src
Figure 5. The URL code block. This block translates the numerical 2010 season into the URL for the website that collects
Major League Baseball statistics.
#+source: raw-data
#+headers: :exports none
#+begin_src sh :cache yes :var url=url :file 2010.csv
wget $url && \
unzip -p gl2010.zip > 2010.csv && \
rm gl2010.zip
#+end_src
Figure 6. The raw-data shell code block. The zip file of statistics located at the specified url is downloaded and its contents
are unpacked into a local text file named 2010.csv.
#+source: stat-headers
#+headers: :exports none
#+begin_src python :results list :cache yes :return fields
import urllib2
url = 'http://www.retrosheet.org/gamelogs/glfields.txt'
fp = urllib2.urlopen(url)
fields = []
for line in fp:
if line.find('Visiting team offensive statistics') != -1:
line = fp.readline()
while line.find('Visiting team pitching statistics') == -1:
if line[13] != ' ':
fields.append(line.strip().split('.')[0].split('(')[0])
line = fp.readline()
#+end_src
#+results[97fdb2368b66e48faa6afb8b6eff34e00f05633b]: stat-headers
- at-bats
- hits
- doubles
- triples
- homeruns
- RBI
- sacrifice hits
- sacrifice flies
- hit-by-pitch
- walks
- intentional walks
- strikeouts
- stolen bases
- caught stealing
- grounded into double plays
- awarded first on catcher's interference
- left on base
Figure 7. The stat-headers Python code block. This block returns a list of the names of the offensive statistics to test
for correlation with attendance.
#+source: attendance
#+headers: :exports none
#+begin_src sh :var file=raw-data
awk '{ print $18 }’ FS="," < $file
#+end_src
Figure 8. The offensive-stats and attendance shell code blocks. These blocks collect the offensive statistics and
attendance from the raw data file produced by the raw-data code block (see Figure 6).
#+source: analysis
#+headers: :var headers=stat-headers :var stats=offensive-stats
#+begin_src R :var attendance=attendance :exports none
# apply the headers to the list
colnames(stats) <- headers
## The following lines are required because parsing bugs are causing
## corrupt data in these two rows.
badrows <- c(141, 674)
stats <- stats[-badrows,]
attendance <- attendance[-badrows,]
attendance <- as.integer(attendance)
Figure 9. The analysis code block. This block uses the R statistical programming language to calculate correlations between
the outputs of the offensive-stats and attendance code blocks (see Figure 7) whose values are saved into the stats
and attendance variables respectively.
Attendance
30,000
gle place. This practice bene- 3 tool, and others alleviate com-
fits readers, who can reproduce 20,000 mon burdens of practicing RR.
2
the calculations performed Of the essential properties,
in the work and also extend 1 10,000 arguably the most important
the analysis, possibly within is that, as part of Emacs, the
0 0
Org-mode itself. For exam- Org-mode copyright is owned
CO
SL
SF
LA
N
YN
N
L-
-A
-L
-A
-A
S
AN
TL
RI
season by simply changing Figure 10. Forced walks and attendance for the mode is now and always will
the value of the season code top five games by forced walks. Results indicate be free and open source soft-
block above and re-exporting that the visiting team shares the fans’ belief in the ware. This directly relates to
the file. effects of a large crowd. two RR goals. First, Org-mode
May/June 2011 7
#+source: top-8
#+begin_src sh :var data=raw-data :exports none
cat $data|awk '{print $60,$18,$7"-"$4}'
FS=","|sed 's/"//g'|sort -rn |head -5
#+end_src
#+source: figure
#+begin_src gnuplot :var data=top-8 :file plot.png that a single Org-mode document can
:exports results be used for every stage of a research
# set term tikz project—from brainstorming, soft-
# set output 'plot.tex' ware development, and experimenta-
set yrange [0:6] tion to publication— Org-mode
set y2range [0:50000] largely relieves authors of the burden
set key above of tracking resources required for
set y2tics border reproducing their work. Although
set ylabel 'forced walks' this information volume can result
set y2label 'attendance' in extremely large files, Org-mode
set style fill pattern documents’ hierarchical folding lets
set style data histogram users comfortably read and edit such
set style histogram clustered files. The files themselves are encoded
set auto x in plain text, which enhances their
set xtic rotate by -45 scale 0 portability and makes them easy to in-
plot data using 1:xtic(3) title 'forced walks', \ tegrate with version control systems,
data using 2 axes x1y2 title 'attendance' allowing for revision tracking and
#+end_src collaboration.7
Org-mode documents run the gam-
#+label: fig:top-5 bit from simple collections of plain-
#+attr_latex: width=0.8\textwidth text notes, to complex laboratories
#+Caption: Top 5 games by forced walks, with forced walks housing data and analysis mechanisms,
and attendance shown. to publishing desks with facilities for
#+results: figure displaying and exporting scientific re-
[[file:plot.png]] sults. There’s a friendly community of
Org-mode users and developers who
Figure 11. The code for the number of forced walks and the attendance for the five
communicate on the Org-mode mail-
games with the most forced walks.
ing list (http://lists.gnu.org/mailman/
listinfo/emacs-orgmode). By answer-
ing questions and helping each other
is available free of charge to install incorporated into almost any com- master Org-mode’s many features,
by any user on any system, which puter work environment. Emacs is also this community helps to solve one of
ensures access to the software envi- widely used by the scientific com- the largest hurdles posed by any RR
ronment required for reproduction. munity for editing both prose docu- tool—learning how to use it.
Second, the source code specifying ments and source code. By leveraging
Org-mode’s inner workings is open to existing Emacs editing support, Org-
inspection, ensuring that the mecha- mode can offer its users a comfortable References
nisms through which Org-mode and familiar editing environment for 1. R.M. Stallman, “Emacs the Extensible,
generates scientific results are open all content types. Finally, given Org- Customizable Self-Documenting Dis-
to review and verification. mode’s implementation in the Emacs play Editor,” ACM Sigplan Notices,
In addition to its open source ped- extension language, E macs Lisp,6 vol. 16, no. 6, 1981, pp. 147–156.
igree, Org-mode benefits in other users can customize Org-mode’s behav- 2. S. Fomel and J.F. Claerbout, “Repro-
ways from its Emacs relationship. ior to their particular needs and support ducible Research,” Computing in Science
Emacs is one of the world’s most arbitrary new programming languages; & Eng., vol. 11, no. 1, 2009, pp. 5–7.
widely ported pieces of software, Org-mode currently supports more 3. J.B. Buckheit and D.L. Donoho,
with versions that run on all major than 30 programming languages. “Wave-Lab and Reproducible
operating systems. This ensures Org-mode addresses many com- Research,” Wavelets and Statistics,
that Org-mode documents can be mon problems in RR practice. Given Springer-Verlag, 1995.
May/June 2011 9