0% found this document useful (0 votes)
3 views

Code analysis and parallelizing vector operations in R

This paper discusses the development of code analysis tools for R to identify errors and improve code reliability, alongside preliminary work on enabling parallelized vector operations in R to utilize multiple processors. The synergy between these two areas is linked to ongoing efforts to create a byte code compiler for R, which could enhance both performance and error detection. Future work aims to refine these tools and explore further opportunities for optimization and integration with R's existing functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Code analysis and parallelizing vector operations in R

This paper discusses the development of code analysis tools for R to identify errors and improve code reliability, alongside preliminary work on enabling parallelized vector operations in R to utilize multiple processors. The synergy between these two areas is linked to ongoing efforts to create a byte code compiler for R, which could enhance both performance and error detection. Future work aims to refine these tools and explore further opportunities for optimization and integration with R's existing functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Comput Stat (2009) 24:217–223

DOI 10.1007/s00180-008-0117-9

ORIGINAL PAPER

Code analysis and parallelizing vector operations in R

Luke Tierney

Received: 5 March 2007 / Accepted: 19 April 2008 / Published online: 8 May 2008
© Springer-Verlag 2008

Abstract This paper presents some current work and preliminary thoughts on two
seemingly unrelated areas. The first is the development of code analysis tools to help
identify possible errors in R code. Current versions of these tools have been useful in
finding bugs in R’s code as well as code in packages submitted to CRAN. The second
area, where work is just beginning, is the development of mechanisms to allow R’s
internal vectorized operations, as well as vectorized operations defined in packages,
to take advantage of multiple processors. These two areas are related through their
connections to ongoing efforts to develop a byte code compiler for R.

Keywords Static analysis · Compilation · Software testing · Parallel computation

1 Introduction

R (R Development Core Team 2006) is a language for interactive data analysis and
graphics closely related to the S language (Becker et al. 1988; Chambers 1998). In
addition to being an interactive language, R is also a powerful high level programming
language that is well suited for expressing complex statistical computations. There can
be a tension between language features that support interactive use, and features that
support the development of reliable, high performance code. Some of the features
designed to make interactive use easier, such as named arguments, partial matching of
arguments, and lazy evaluation of arguments along with capturing argument expres-
sions, for example for default plot annotation, can make identifying errors harder and
can incur performance penalties. This paper describes two lines of work intended to
improve the reliability and performance of R code while preserving the flexibility for

L. Tierney (B)
Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA
e-mail: [email protected]

123
218 L. Tierney

interactive use. The first line of work is the development of code analysis tools for R;
the second is preliminary work on adding support for parallelized computation to the
R vectorized arithmetic engine. On the surface these two areas seem quite unrelated,
but there is in fact a strong synergy between the two.

2 Code analysis for R

The R package system provides an infrastructure for testing: Example code and any test
code included in a package’s tests directory is run by R CMD check. Unit testing
frameworks, e.g. RUnit (Burger et al. 2006), have also been developed. Testing is
an essential component of software quality assurance, but relying on testing alone has
drawbacks: Most tests need to be created manually, and complete coverage cannot be
guaranteed. Static code analysis is a useful supplement to testing.

2.1 Static code analysis

Static code analysis examines source code without executing it. Static analysis can
look at individual expressions, larger patterns of expressions, or relationships among
functions and modules. Nielson et al. (1998) provide an in-depth introduction to the
area.
For C programs, for example, the compiler carries out basic code analysis and
reports errors during the compilation process; if an error is encountered it is typical
for a C compiler to stop generating code but continue the code analysis phase in order
to provide the programmer with as much feedback as possible. More sophisticated
tools that analyze C code for usage patterns that lead to errors, such as null pointer
de-referencing or buffer overflows, have recently been developed. Some of these tools
have been used very successfully in finding up to then unidentified bugs in the Linux
kernel.
Most code analysis involves approximations. Many of the questions to which one
would ideally like answers are provably undecidable. False positives, and of course
false negatives, are unavoidable. For a code analysis framework to be effective, it is
essential to be able to tune the specificity and sensitivity of the framework. Being able
to rank potential issues in a way that emphasizes issues that are most likely to be of
concern, analogous to the ordering of matches in search engines, can also be very
helpful. Kremenek and Engler (2003) and Kremenek et al. (2004) report some early
results on statistical ranking approaches for static analysis of C code.

2.2 Code analysis for R

The R language presents some unusual challenges to code analysis. Whether a variable
is global or local may depend on the value of arguments. Functions can create new
variables in their callers; this is in fact used in the function glm.fit where evaluating
family$initialize expressions can create new variables. Functions could also
remove variables from their callers, though this is fortunately very rare within code. In

123
Code analysis and parallelizing vector operations in R 219

addition, some functions use nonstandard evaluation of some arguments. The functions
library, curve, and standard link functions are examples. Finally, in a function
like with the evaluation environment is not statically available.
While these issues impose limitations on static analysis, it is nevertheless possible
to make significant progress. The codetools package represents a step in this direc-
tion. codetools analyzes expressions in the context of visible definitions. Issues
codetools can detect include calls not consistent with visible function definitions,
badly formed assignment expressions, undefined functions or variables used, calls with
no visible function definition, local variables assigned but not used, and parameters
changed by assignment. Not all of these are errors, but many suggest a closer look at
the code may be warranted.
The two main functions in the codetools package are checkUsage for check-
ing individual R functions, and checkUsagePackage for checking all the defini-
tions in a package. These functions take a range of optional arguments for selecting
the issues to report or suppress. This approach to managing sensitivity and specificity
is workable but less than ideal; future work will explore alternative approaches.
As a simple example, the following artificial function contains a number of possible
errors:
g<-function(x, exp = TRUE) {
nexp <- ! exp
if (exp)
exp(x+3) + ext(z-3)
else
log(x, bace=2)
}
The result of passing this function to checkUsage is
> checkUsage(g, name = "g")
g: no visible global function definition for ‘ext’
g: no visible binding for global variable ‘z’
g: possible error in log(x, bace = 2): unused
argument(s) (bace = 2)
g: local variable ‘nexp’ assigned but may not be
used
The function ext might be a misspelling of exp; z should perhaps have been x; and
the non-existent named argument bace in the call to log probably should have been
base.
Defining a local variable that is not used, like nexp in the previous example, is
a potential inefficiency, but it may also be indicative of a more serious problem. For
example, running checkUsagePackage on the base package in R 2.4.1 produces
> checkUsagePackage("base")
...
substring: local variable ‘x’ assigned but may not
be used
...

123
220 L. Tierney

The definition of substring is


substring <- function (text, first, last = 1e+06)
{
if (!is.character(text))
x <- as.character(text)
n <- max(lt <- length(text), length(first),
length(last))
if (lt && lt < n)
text <- rep(text, length.out = n)
substr(text, first, last)
}
Careful reading of the code, encouraged by the warning message that the local variable
x is not used, shows that the intent was to coerce the value of text to a character
object. This will be corrected in a future R release.
Currently codetools is being used by a number of programmers for check-
ing their packages. It is also being used routinely for screening of CRAN submis-
sions. Another example of its use is provided by the weaver package of Falcon and
Gentleman (2007). As of R version 2.5 codetools has been included as one of the
recommended packages bundled with binary releases. Eventually the facilities pro-
vided by the codetools package may be integrated into the tools package.
Future work on codetools will include a framework to allow package writers to
provide rules for handling functions that use non-standard evaluation. The possibil-
ity of allowing optional declarations within functions to inform codetools about
whether a particular warning is appropriate will also be investigated. As a simple
example, codetools can be asked to warn about unused arguments, but this is cur-
rently turned off by default because it would lead to too many false positives. With a
mechanism for declaring that a particular argument, for example a . . . argument in a
method, is intentionally ignored, this option can be turned on and may then identify
some genuine issues. Future work will also explore the possibility of detecting usage
patterns of one or more expressions that are suggestive of possible bugs or inefficien-
cies. Another direction to explore is how to provide mechanisms for supporting and
integrating with editors and graphical user interfaces. With recent changes in the way
source attributes are stored it may be possible to get more precise information on the
location of a possible problem, and to allow a click on a warning message to bring up
the relevant line of code in an editor window. The codetools framework can also
be used to construct a static call graph that can then be displayed, again with possible
editor integration.

3 Parallelizing vector operations

Multi-core processors are becoming increasingly common: Many laptops have dual
core processors, and quad core workstations are becoming available at affordable
prices. In principle this allows speedups by a factor of 2 or 4. This is attractive if it can
be achieved at little or no user effort, but may not be enough to justify the effort of
explicit parallel programming. Automatic parallelization, from a user perspective, is

123
Code analysis and parallelizing vector operations in R 221

n Sequential n Sequential

n/2 n/2

Parallel Parallel
n/2 n/2

An idealized view A more realistic view


Fig. 1 Parallelizing a vectorized function using two threads. An idealized view and a more realistic view
that accounts for synchronization overhead

already possible by using a version of the BLAS that uses threads to parallelize some
BLAS routines on multi-processor hardware. This can provide significant speedups
for linear algebra computations that rely on the BLAS.
Research is now beginning to explore whether there are other aspects of the R
framework that can be parallelized similarly in a user-transparent way. The most
immediate candidate for automatic parallelization is the basic vectorized arithmetic
framework provided in R. Other possibilities include simple uses of the apply family
and the sweep function.
The basic idea for computing the value of a vectorized function f(x[1:n]) on
data of length n in parallel on two processors is to divide the work among two worker
threads, computing results for n/2 data values in each thread. The idealized view,
shown on the left of Fig. 1, is that this will cut the time needed for the computation
in half. A more realistic view, illustrated on the right of Fig. 1, is that there is syn-
chronization overhead involved in initiating the computations on the threads and in
coordinating the completion of the computations.
An implementation based on the POSIX threads standard would use a mutex and
a condition variable for these synchronization tasks (Butenhof 1997). As a result,
parallelization will only pay if n is large enough.
How large n needs to be for parallel computation to be useful depends on a number of
factors. For example, preliminary experiments on a particular two-processor machine,
with a particular operating system, and a particular implementation strategy found that
for a computationally intensive function like qbeta n ≈ 4 appears to be sufficient
to warrant parallel computation; functions such as qnorm require n ≈ 400; and for
basic arithmetic operations n ≈ 30, 000 may be needed. Careful tuning strategies
that account for differences among functions as well as variations among hardware,
operating systems, and system loads will be needed to ensure that parallel vector
computations are successful.
The two projects discussed in this paper, code analysis and parallel vectorized
arithmetic operations seem on the surface quite unrelated but there is a synergy: Both
are related to efforts on compilation of R. Developing a byte code compiler for R
is an ongoing project (Tierney 2001). The current codetools implementation is
a by-product of these efforts. The connection to vectorized operations is that many
of these operations occur in compound expressions, like exp(-0.5*xˆ2). This
example can be viewed, simplifying a bit, as the three operations of squaring, scaling,
and exponentiating applied to a data vector. A parallelized interpreter will need to
synchronize at the beginning and end of each of these elementary operations. With

123
222 L. Tierney

SQUARE SCALE EXP Sequential

SQUARE SCALE EXP

Interpreted
SQUARE SCALE EXP

SQUARE SCALE EXP

Compiled, fused
SQUARE SCALE EXP

Fig. 2 An illustration of the reduced synchronization in compound vector operations made possible by
compilation

compilation, we can arrange for synchronization to occur only at the beginning and
end of the full compound operation, with each thread running the three elementary
operations for its share of the data to completion without intermediate synchronization.
This is illustrated in Fig. 2.
As a result, parallel computation may be beneficial for much smaller data sizes. In a
similar vein, compilation may make it possible to identify uses of the apply functions
or the sweep function where computations can safely be carried out in parallel.
Work on parallel vectorized arithmetic is just beginning and many issues remain to
be studied. As already mentioned, developing tuning strategies that address possible
hardware and operating system differences as well as competing usage is an important
step. In addition, for some functions, in particular ones with iterative implementa-
tions, performance may vary with input values. Load balancing strategies may be
important in these cases. Other issues include developing a mechanism for handling
user interrupts in potentially long running routines in a parallel context, and devel-
oping an infrastructure so packages can develop vectorized routines that can use the
parallelization strategy. For compiled code this will require developing an extension
mechanism for the byte code itself.

4 Conclusions

The work described in this paper is motivated by the desire to improve the correctness
and the performance of R code. There is a strong synergy in that code analysis tools
developed to identify possible errors can also be used to identify opportunities for
performance improvement, either in an automated way through compilation or by
suggesting possible more efficient rewriting. Substantial progress has been made on
code analysis, but much more can be done. Parallelization work is just starting but
early results seem very promising. Hopefully there will be significant progress in the
near future.

Acknowledgments Work on these projects was supported in part by National Science Foundation grant
DMS-0604593 and NIH grant HG002708. Some of the computations for this paper were performed on
equipment funded by National Science Foundation grant DMS 0618883.

123
Code analysis and parallelizing vector operations in R 223

References

Becker RA, Chambers JM, Wilks AR (1988) The new S language: a programming environment for data
analysis and graphics. Wadsworth, Belmont
Burger M, Juenemann K, Koenig T (2006) RUnit: R Unit test framework. R package version 0.4.14
Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc.,
Boston
Chambers JM (1998) Programming with data: a guide to the S language. Springer, Heidelberg
Falcon S, Gentleman R (2007) The weaver package: tools and extensions for processing Sweave documents.
In: Proceedings of DSC 2007
Kremenek T, Engler DR (2003) Z-ranking: using statistical analysis to counter the impact of static analysis
approximations. In: Cousot R (ed) SAS. Lecture Notes in Computer Science, vol 2694. Springer,
Heidelberg, pp 295–315
Kremenek T, Ashcraft K, Yang J, Engler D (2004) Correlation exploitation in error ranking. SIGSOFT
Softw Eng Notes 29(6): 83–93
Nielson F, Nialson HR, Hankin C (1998) Principles of program analysis. Springer, Heidelberg
R Development Core Team (2006) R: a language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna. ISBN 3-900051-07-0
Tierney L (2001) A preliminary report. In: Proceedings of the 2nd international workshop on distributed
statistical computing

123

You might also like