0% found this document useful (0 votes)
30 views11 pages

Optimizing R VM: Allocation Removal and Path Length Reduction Via Interpreter-Level Specialization

p295-wang

Uploaded by

adakkak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views11 pages

Optimizing R VM: Allocation Removal and Path Length Reduction Via Interpreter-Level Specialization

p295-wang

Uploaded by

adakkak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Optimizing R VM: Allocation Removal and Path

Length Reduction via Interpreter-level Specialization


Haichuan Wang
University of Illinois at
Urbana-Champaign
[email protected]
Peng Wu
IBM T.J. Watson Research Center
[email protected]
David Padua
University of Illinois at
Urbana-Champaign
[email protected]
Abstract
The performance of R, a popular data analysis language,
was never properly understood. Some claimed their R codes
ran as efciently as any native code, others quoted orders of
magnitude slowdown of R codes with respect to equivalent
C implementations. We found both claims to be true depend-
ing on how an R code is written. This paper introduces a rst
classication of R programming styles into Type I (looping
over data), Type II (vector programming), and Type III (glue
codes). The most serious overhead of R are mostly mani-
fested on Type I R codes, whereas many Type III R codes
can be quite fast.
This paper focuses on improving the performance of Type
I R codes. We propose the ORBIT VM, an extension of the
GNU R VM, to perform aggressive removal of allocated ob-
jects and reduction of instruction path lengths in the GNU
R VM via prole-driven specialization techniques. The OR-
BIT VM is fully compatible with the R language and is
purely based on interpreted execution. It is a specialization
JIT and runtime focusing on data representation specializa-
tion and operation specialization. For our benchmarks of
Type I R codes, ORBIT is able to achieve an average of 3.5X
speedups over the current release of GNU R VM and outper-
forms most other R optimization projects that are currently
available.
Categories and Subject Descriptors D.3.4 [Processors]:
Compilers, Interpreters, Run-time environments
General Terms Languages, Performance
Keywords R, Specialization, Dynamic Scripting Language
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from [email protected].
CGO 14, February 15 - 19, 2014, Orlando, FL, USA.
Copyright c 2014 ACM 978-1-4503-2670-4/14/02. . . $15.00.
http://dx.doi.org/10.1145/2544137.2544153
1. Introduction
In the age of big data, R is a tremendously important lan-
guage and considered the lingua franca for data analysis [18,
28]. There are more than two million users of R today and
the user base is rapidly expanding. The popularity of R is
mainly due to the productivity benets it brings to data anal-
ysis. R contributes to programmer productivity in several
ways, including the following two: the availability of exten-
sive data analysis packages that can be easily incorporated
into an R script and the interpreted environment that allows
for interactive programming and easy debugging.
Like many other interpreted and dynamically typed lan-
guages, R suffers from a critical limitation: it is very slow.
Figure 1 compares the performance of a set of algorithms [3]
implemented in different languages and shows that the GNU
R VM, the most widely used R implementation today, is
more than two orders of magnitude slower than Cand twenty
times slower than CPython, the Python interpreter.
1
801.3 603.5
392.4
794.2
10212.4
117.5
752.0
8.8 11.1
7.6
128.2
674.5
5.1
26.2
1
10
100
1000
10000
100000
Slowdown to C Slowdown to Python
Figure 1. Slowdown of R on the shootout benchmarks
relative to C and CPython.
1.1 Three R Programming Styles
To better understand the landscape of existing R optimiza-
tion projects, one has to consider the different ways in which
a programmer may use the language. Figure 2 shows three
different R programming styles.
1
For pidigits, the steep performance gap also comes from algorithm
differences. For instance, the Python version uses the built-in big number
support whereas the R version uses vectors to represent big numbers.
295
Listing 1. Type I: Looping Over Data
f or ( i i n 1: 10000000 ) {
a [ i ] < b [ i ] + c [ i ] ;
}
Listing 2. Type II: Vector programming
# a , b , and c are v e c t o r s
a = b + c
Listing 3. Type III: Glue codes
# rnorm and f f t are n a t i v e f u n c t i o n s
a < rnorm( 2400000) ;
b < f f t ( a )
Figure 2. Three different R programming styles.
Type I (Loop) Type II (Vector) Type III (Library)
N
o
n
-
c
o
m
p
a
t
i
b
l
e

C
o
m
p
a
t
i
b
l
e

C
o
m
p
a
t
i
b
l
e

w
/

r
e
f
e
r
e
n
c
e

i
m
p
l
e
m
e
n
t
a
t
i
o
n

Program Types
ORBIT Revolution R
Rapydo (PyPy)
R Byte-code
Interpreter
Riposte
FastR (Java)
Renjin (Java)
pqR
LLVM R
TERR
TruffleR (Java)
Legend
JIT based
No JIT
Figure 3. Landscape of R optimization projects.

Type I: Looping over data. As shown in Listing 1, Type


I is the most natural programming style to most uses but
is also the most costly in terms of performance due to the
overhead of operating on individual vector elements in R.

Type II: Vector programming. Like Matlab, one can


write R codes using vector operations as shown in List-
ing 2. Since basic vector operations are implemented na-
tively in the R VM, Type II codes are more efcient than
Type I, especially when applied to long vectors.

Type III: Glue codes. In this case R is used as a glue to


call different natively implemented libraries. List 3 shows
such an example, where fft and rnorm are native library
routines.
Among the three programming styles, Type I has by far
the most serious performance problem. The several orders
of magnitude of performance gap shown in Figure 1 refer to
Type I R codes. Type II codes may exhibit large performance
gap when applied to short vectors. Type II codes whose ex-
ecution is dominated by long vectors are much better per-
forming but may suffer from memory subsystem problems
due to the loss of temporary locality in long vector codes.
The performance of Type III codes depend largely on the
quality of the native library implementations.
Many real R codes use a mixture of all three program-
ming styles. In practice, due to the overhead of Type I and II
codes, programmers try to follow the Type III style for pro-
duction R codes. However, according to [19], a signicant
portion of R scripts still spend most of the execution time in
Type I and II codes.
1.2 R Optimization Project Landscape
Improving the performance of R is the focus of this and
many other research projects. Figure 3 summarizes all the
major projects on improving R performance through JIT
compilation or VM-level optimizations. We classify the
projects according to their target R programming styles (x-
axis) and the compatibility with the GNU R VM (y-axis).
The focus of many commercial implementations of R,
such as Revolution Analytics [7] and TERR [15], is on
facilitating the implementation of Type III codes. Typically,
these products do not modify the internals of the Virtual
Machine (VM) and is fully compatible withwith the GNU
R VM.
Many projects targeting type I and II codes build a brand
new R VM [5, 14, 16, 17, 24, 26, 30]. This is the most ef-
fective way to address some of the most signicant perfor-
mance problems of R, many of which directly result from
the inefciencies in the implementation of the GNU R VM.
However, these new implementations are often incompati-
ble with the GNU R VM, creating a signicant barrier to
adoption by the R user community. This is because the thou-
sands of packages available from public R repositories, such
as CRAN [22] and Bioconductor[2], are the most valuable
assets of R, and many of these packages depend on internals
of the GNU R VM.
The third approach to improving R performance, which
is the least commonly used, is to target Type I or II codes by
extending the GNU R VM without sacricing compatibility.
Examples are pqR [20] and the byte-code compiler and
interpreter [27] and incorporated into the GNU R VM since
R-2.14.0. Our work, the ORBIT VM, also falls into this
category.
1.3 Our Approach: the ORBIT VM
In this work, we aim at improving the performance of Type
I codes, while maintaining the full compatibility with the
GNU R VM. There have been many attempts in the past to
improve performance of other scripting languages, such as
Python and Ruby, while maintaining. These attempts have
had limited successes [8]. We offer a new approach to tack-
ling the problem that combines JIT compilation and runtime
techniques in a unique way:
296
R Source Code R AST Parser
Byte-code
Compiler
R Byte-code
Pre-compiled
packages
Default Interpreter Byte-code Interpreter
R Runtime Environment
(Local frames, global frames, built-in function objects)
Figure 4. The GNU R VM.

Object allocation removal. We found that excessive


memory allocation is the root cause of many performance
problems of Type I R codes. As reported in [19], R al-
locates several orders of magnitude more data than C
does. This results in both heavy computation overhead to
allocate, access, and reclaim data and excessively large
memory footprints. For certain type of programs such as
those that loop over vector elements, more than 95% of
the allocated objects can be removed with optimizations.
In order to signicantly bridge the performance gap be-
tween R and other languages, we need to remove most of
the object allocations in the GNU R VM not just some of
them.

Prole-directed specialization. Specialization is the


process of converting generic operations and representa-
tions into more efcient forms based on context-sensitive
information such as the data type of a variable at a pro-
gram point.
When dealing with overhead in a generic language run-
time such as Rs, we believe prole-directed special-
ization is a more effective technique than using tradi-
tional data-ow based approaches. This is because pro-
gram properties obtained by the latter are often too im-
precise to warrant the application of an optimization. For
instance, traditional data-ow based allocation removal
techniques, such as escape analysis, often cannot achieve
our target of removing most allocation operations. In-
stead, our framework relies heavily on specialization and
shifts many tasks of a static compiler to the runtime.

Interpretation of optimized codes. Our framework op-


erates entirely within the interpreted execution. The JIT
compiles original type-generic byte-codes into new type-
specic byte-codes; and the R bytecode interpreter is ex-
tended to interpret these new specialized byte-codes.
We focus on interpretation for two reasons. First, not
having to generate native codes in the initial prototype
greatly simplies the implementation without affecting
our ability to focus on our objectives: allocation removal
and specialization. Secondly, we would like to explore
a new solution space for scripting language runtime that
delivers good performance while preserving the simplic-
ity, portability and interactiveness of an interpreted envi-
ronment.
We have implemented our framework ORBIT (Optimized
R Byte-code InterpreTer) as an extension to the GNU R
VM. On Type I codes, ORBIT achieves an average speedup
of 3.5 over the GNU R byte-code interpreter and 13 over
the GNU R default interpreter, all without any native code
generation. As a proof of concept, the current implementa-
tion includes only a few specializations and focus primarily
on scalar, vector and matrix operations. This leaves much
room for improvement when other types of specialization
are supported.
2. Background
2.1 GNU R VM
Figure 4 depicts the execution model of the GNU R VM
since version R-2.14.0 that includes a parser, two inter-
preters, and a runtime system that implements the object
model, basic operations, and memory management of R.
2.1.1 A Tale of Two Interpreters
The R default interpreter is the default interpreter used by
the GNU R VM. This interpreter operates on the Abstract
Syntax Tree (AST) form of the input R program and is
quite slow. Since R version 2.14.0, a stack-based R byte-
code interpreter [27] was introduced as a better performing
alternative to the default interpreter. To enable the byte-code
interpreter, the user has to explicitly invoke an interface to
compile a region of R code into byte-codes for execution or
set an environment variable to achieve the same goal. Since
the two interpreters use the same object model and share
most of the R runtime, it is possible to switch between the
two interpreters at well-dened boundaries such as function
calls.
The byte-code interpreter has a simple ahead-of-time
(AOT) compiler that translates ASTs generated by the parser
to byte-codes. Figure 5 shows an example byte-code se-
quence and the corresponding symbol table. The AOT com-
piler also performs: simple peephole optimizations, inlining
of internal functions, faster local variable lookup, and spe-
cialization of scalar math expressions. For Type I codes, the
bytecode interpreter is several times faster than the default
interpreter. Starting with R-2.14.0, many basic R packages
297
are compiled into byte-code during installation and executed
in the byte-code interpreter.
Idx Value
1 :
2 1
3 10000000
4 1:10000000
5 i
6 for(i in ){}
7 r
8 r+i
9
STMTS
GETBUILTIN, 1
PUSHCONSTARG, 2
PUSHCONSTARG, 3
CALLBUILTIN, 4
STARTFOR, 6, 5, (22)
GETVAR, 7
GERVAR, 5
ADD, 8
SETVAR, 7
POP
STEPFOR, (13)
ENDFOR
Byte-Code Sequence
PC
1
3
5
7
9
13
15
17
19
21
22
24
... ...
Symbol Table Source Code
for(i in 1:10000000){
r <- r+i;
}
...
R Byte-code Compiler
Figure 5. R byte-code and symbol table representation.
2.1.2 R Object Model
The GNU R VM denes two basic object representations:
SEXPREC and VECTOR SEXPREC (henceforth referred to
as VECTOR for short). As shown in Figure 6, an object has a
object header (SEXPREC HEADER) and a body. The object
header is the same for SEXPREC and VECTOR and contains
three pieces of information:

sxpinfo that encodes the meta data of an object such as


data type and object reference count

attrib that records object attributes as a linked list

prev node and next node that link all R objects for the
garbage collector
The VECTOR data structure is used to represent vector
and matrix objects in R. The body of the VECTOR records
vector length information and the data stored in the vector.
Scalar values are represented as vectors of length one.
The SEXPREC data structure is used to represent all R
data types not represented by VECTOR such as linked-list
and internal R VM data structures such as the local frame.
The body of SEXPREC contains three pointers to SEXPREC
objects: CAR, CDR, and TAG. For instance, a local frame
is represented as a linked list of entries where each entry
contains pointers to a local variable name, the object assignd
to the local variable, and the next entry in the the frame.
2.1.3 Memory Management
Memory Allocator The memory allocator pre-allocates
pages of SEXPREC. A request is satised just by getting one
free node from a page. The memory allocator also preallo-
cates some small VECTOR objects in different page sizes to
satisfy requests for small vectors. A large vector allocation
request is performed through the system malloc.
Garbage Collector R VM does automatic garbage collec-
tion (GC) with a stop-world multi-generation based collec-
tor. The mark phase traverses all the objects through the link
pointers of the object headers. Dead objects are then com-
SEXPREC VECTOR_SEXPREC
sxpinfo_struct sxpinfo
SEXPREC* CAR
SEXPREC* CDR
SEXPREC* TAG
SEXPREC* attrib
SEXPREC* pre_node
SEXPREC* next_node
sxpinfo_struct sxpinfo
SEXPREC* attrib
SEXPREC* pre_node
SEXPREC* next_node
R_len_t length
R_len_t truelength
Vector raw data
S
E
X
P
R
E
C
_
H
E
A
D
E
R

Figure 6. Internal representation of R Objects.
pacted to free pages. Dead large vectors are freed and re-
turned to the system.
Copy-on-write Every named object in R is a value object
(i.e., immutable). If we assign a variable to another variable,
the behavior specied by the semantics of R is that the value
of one variable is copied and this copy is used as the value of
the other variable. R implemented copy-on-write to reduce
the number of copy operations. There is a named tag in the
object header, with three possible values: 0, 1, and 2. Values
0 and 1 mean that only one variable points to the object
(value 1 is used to handle a special intermediate state
2
). By
default the named value is 0. When the variable is assigned
to another variable, which means more than one variable
point to the same underlying object, the objects named tag
is changed to 2.
When an object is to be modied, the named tag is con-
sulted. If the value is 2, the runtime rst copies the ob-
ject, and then modies the newly copied object. Because the
runtime cannot distinguish whether more than one variable
point to the object, named remains 2 in the original object.
2.2 Performance of Type I Codes
Type I R programs suffer from many performance problems
that it has in common with of other interpreted dynamic
scripting languages, including

Dynamic type checking and dispatch

Generic data representation

Expensive name-based variable lookup

Generic calling convention (e.g., heap-allocated variable-


length argument list)
On the other hand, R also introduces some unique perfor-
mance issues that are specic to its semantics, such as

Missing value NA number support. A special routine is


required in all the mathematics operations.

Out of bound handling. There is no out of bound error.


Accessing out of bound value just returns a NA value, and
assigning out of bound value expands the vector, lling
the missing value with NA.
2
http://cran.r-project.org/doc/manuals/R-ints.html
298
foo <- function(a) {
b <- a + 1
}
Idx Value
1 a
2 1
3 a+1
4 b
STMTS
GETVAR, 1
LDCONST, 2
ADD, 3
SETVAR, 4
INVISIBLE
RETURN
R Opt Engine
If a is real
scalar
STMTS
GETREALUNBOX, 1
LDCONSTREAL, 2
REALADD
SETUNBOXVAR, 4

Specialized byte-code
Specialized data
representation
SEXPREC ptr
real scalar
real scalar
Stack
Function source
Symbol Table
Byte-Code
PC
1
3
5
6

SEXPREC ptr
Stack SEXPREC ptr
SEXPREC ptr
Original data
representation
VECTOR
VECTOR
a
1
PC
1
3
5
7
9
10
Generic Domain Specialized Domain
Figure 7. An example of ORBIT specialiation.

No reference, assign is copy, and pass-by-value in func-


tion calls. This feature is optimized by copy-on-write
support introduced in the previous section.

Lazy evaluation. In order to force a promise argument, a


new interpreting context is created.
2.2.1 A Motivating Example
For Type I R codes, the performance problems ultimately
manifest in the form of long instruction path lengths and ex-
cessive memory consumption compared to other languages.
Metrics Default Interpreter Byte-code Interpreter
Machine Instructions 26,080M 3,270M
SEXPREC Object 20M 20
VECTOR Scalar 10M 10M
VECTOR Non-scalar 1 2
Table 1. Number of dynamic instructions executed and ob-
ject allocated for the example in Figure 5.
Consider the example shown in Figure 5 which accumu-
lates a value over a loop of 10 million iterations. Table 2.2.1
shows the number of dynamic machine instructions executed
and the number of objects allocated for the loop using R-
2.14.1 running on an Intel Xeon processor. On average, each
iteration of the accumulation loop takes over 2600 machine
instructions if executed in the default interpreter or 300 ma-
chine instructions if executed in the byte-code interpreter.
The number of memory allocation requests is also high. For
instance, the default interpreter allocates two SEXPREC ob-
jects and one VECTOR object for each iteration of the simple
loop. The ephemeral objects also give a large pressure to the
garbage collection component later.
We found two main cause of excessive memory alloca-
tions in this code. There are two variable bindings in each
iteration, the loop variable i, and the new result r. Each bind-
ing creates a new SEXPREC object to represent these vari-
ables in the local frame. The scalar VECTOR is the result of
the addition (the interpreter does not create a new VECTOR
scalar object for the loop index variable). Because all R ob-
jects are heap allocated, even a scalar result requires a new
heap object to hold it. The byte-code interpreter optimizes
the local variables binding. But a scalar vector is still re-
quired for the addition. Furthermore, a very large non-scalar
vector 1:10000000 is allocated to represent the loop space.
3. The ORBIT VM: Optimizing R via
Interpreter-level Specialization
The performance problems of the GNU R VM are caused
mainly by the generic implementation of the R runtime and
object model. To address these problems, we choose to use
specialization as the main technique for optimization and
focus exclusively on reducing the number of allocated ob-
jects. Our framework, the ORBIT (Optimized R Byte-code
InterpreTer) VM, is an extension of the GNURVM. ORBIT
and GNU R VM accept exactly the same language.
We start, in Section 3.1, with a small specialization ex-
ample to explain the key concepts of our approach. Then, in
Section 3.2, we explain each module we developed.
3.1 A Specialization Example
Figure 7 shows a small example of specialization. For each
function, the R byte-code compiler produces two compo-
nents, the symbol table and the byte-code. The symbol table
records all the variable names, constant values and expres-
sions in the source code. The byte-code instruction uses the
symbol table to look for a value.
The byte-code instructions are type generic. For exam-
ple, GETVAR looks up the variable a in the local frame,
and pushes a pointer to it into the stack. LDCONST rst du-
plicates, creating a new VECTOR) the value in the constant
table, then pushes the pointer to the newly created vector into
the stack. The ADD checks the types of the two operands at
the top of the stack, and dynamically chooses the addition
routine according to the type of the two operands. A new
VECTOR is created during the addition to hold the result,
and a pointer to it is pushed into the stack.
In order to do the specialization, we need to know the
type of a. Our R optimization engine starts with a runtime
type proling. Then, it uses the proled type to do a fast type
inference. In this example, the type of the constant is known
statically as real. If the type of a is proled as real, too, the
299
compiler will generate specialized code assuming that the
ADDoperates on real values. Furthermore, the compiler uses
specialized data representation, unboxed real scalar in this
case, to represent the values. The right hand side of Figure
7 is the specialized result. The compiler makes use of a new
class of byte-code instructions and a new data format for the
specialization. The specialized byte-code does not require
the dynamic type checking and dispatching. The specialized
data representation saves the copy of the constant value and
the new heap object to store the result.
However, the type of the variable a may not be real the
next time this code segment is executed. To handle this, the
compiler adds a GUARD check in the instruction GETRE-
ALUNBOX. A guard failure translates the specialized data
format into the original generic data representation, and rolls
back to the original type generic byte-code.
This simple example illustrates the main characteristics
of our approach, including

Runtime type proling and fast type inference

Specialized byte-code and runtime function routines

Specialized data representation

Redundant memory allocation removal

Guards to handle incorrect type speculation


3.2 ORBIT Components
R Byte Code
Compiler
R Byte-code Interpreter
Specialized Byte-Code
Execution Extension
Runtime
Profiling
Component
R Opt Byte-Code Compiler
Native Code
Generation
(future)
Code Selection and Guard
Failure Roll Back
Runtime
feedback
Original
Component
New
Component
Byte-code
Specialized
byte-code
Figure 8. The ORBIT VM.
We implemented our approach as an extension to the
current R byte-code interpreter. Figure 8 shows the diagram
of the architecture. Our extension works within the byte-
code interpreter. It does a lightweight type proling the rst
time the code is executed. The second time the function
is executed, ORBIT compiles the original byte-codes into
specialized byte-codes with guards. Specialized byte-codes
use our data representation and are interpreted in a more
efcient way. If a guard detects a type speculation failure,
the interpreter rolls back to the original data format and
byte-code sequence, and uses the meet of the types as a new
prole type.
3.2.1 Runtime Type Proling
Although type inference can be done without runtime type
proling, pure static type inference is complex and not suf-
ciently precise especially in the presence of dynamic at-
tributes like those of R. By only instrumenting a few in-
structions, we simplify the type inference and get a more
precise result. The extension instruments 1) load instruc-
tions, such as GETVAR; 2) function call instructions, such
as CALL, CALLBUILTIN; 3) SUBSET and SUBASSIGN re-
lated instructions .
The proler records the type of the result after an instru-
mented instruction is interpreted. The type information of
a generic R object is stored in three places: the type in the
header, attrib also in the header to specify the number of
dimensions (there is no dim attribute if the number of di-
mensions is one, i.e. in the case of a vector), and length in
the body section to specify the vector length of VECTOR.
Our proler checks all these attributes, and combines them
into a type (see next section). If one instruction is proled
several times, the nal type is the meet of all the types pro-
led. Because of the R object structure, the type proling is
more complex than other dynamic languages. By carefully
design the proling component, the overhead of proling is
typically less than 10%.
3.2.2 Type Inference
We dened a simple type lattice system for the type infer-
ence, shown in Figure 9. All vector types have two compo-
nents, base type (logical, integer, real, etc.) and the length.
T : Undefined
: Generic Object
Matrix List
Logical
Vector
Integer
Vector
Real
Vector
Number
Vector
Figure 9. The type system of ORBIT.
The initial type information comes from type proling or
constant objects type. The initial type of a stack operand
generated by a proled instructions is set to the proled type.
If there is no proling result (the path is not executed during
the proling run), the type is set to the bottom type, generic
R object. The initial type of a stack operand generated by a
load constant instruction is set exactly to the constants type.
All other types of stack operands and local frame variables
are set to the undened(top) type.
The type inference we use is the standard data ow based
algorithm. The algorithm follows interpretation order and
uses each instructions semantics to compute types until all
the types are stable.
We mark all the types that rely on proling as speculated
types. All the specialized instructions that use speculated
types contain a guard to do the check.
300
3.2.3 Object Representation Specialization
In order to efciently represent a typed R object, we use
two specialized data structures, 1) a specialized stack; 2)
the unboxed value cache to hold values in the current local
frame.
Int Scalar
Real Scalar
VM Stack
SEXPREC*
SEXPREC*
Address 0 | Int type
Address 1 | Int type
Type Stack
Int Scalar
SEXPREC*
Address 2 | Real type
Figure 10. The VM stack and type stack in ORBIT.
Stack with boxed and unboxed values The original R vec-
tor is always considered by the interpreter as a VECTOR,
even if it is a scalar. In our specialized R runtime, we repre-
sent scalar numbers (boolean, integer and real) as unboxed
values, and store them directly into the VM stack. As the
VM stack can store both object pointers and unboxed val-
ues, we use another data structure called type stack to track
all the unboxed values in the VM stack as illustrated in Fig-
ure 10. Each element in the type stack has two elds: the
address of the unboxed value on the VM stack, and the type
of that value. Type specialized byte-code instructions oper-
ate on unboxed values on the VM stack as well as update the
records in the type stack.
The type stack is used for two purposes. First, during
a garbage collection process, the marker uses it to ignore
unboxed values stored on the VM stack. Second, on a guard
failure roll back, the guard failure handler uses it to restore
the VM stack to the type generic form.
Unboxed value cache In the R byte-code interpreter, the
values of local variables are stored in the local frame (a
linked list). Load and store operations traverse the linked list
and read or modify the binding cells. When storing a new
variable, the interpreter must create and insert a new binding
cell SEXPREC object. If the object value can be expressed as
an unboxed value, ORBIT optimizes the load or the store by
making use of an unboxed value cache that avoids the need
to do a traversal of the frame linked list as well as create a
new binding cell object for store operation.
This cache is only used to store the value of local frame
variables. Each cache entry is used for one local frame vari-
able. We use the index in the symbol table to locate the cache
entry. Each cache entry has three elds, the unboxed value
to store unboxed value, the type of the value, and the state of
the cache entry. There are three cache entry states, INVALID,
VALID, MODIFIED. The initial state is INVALID. A load
scalar value instruction checks the cache entrys state. If it is
INVALID, the instruction loads the original object, unbox it,
store it into the value cache and set the entry as VALID state.
The instruction then pushes the unboxed value on top of the
VM stack. If the entry is VALID or MODIFIED, it directly
pushes the unboxed value on top of the VM stack. A store
scalar value instruction directly modies the unboxed value
in the cache entry and sets the entry state as MODIFIED.
Because there are two places to store one value, the value
cache and the local frame, we use write-back mechanism to
do the synchronization. Because there are two places to store
one value, the value cache and the local frame, we use write-
back mechanism to do the synchronization. The write back
process creates a new R generic object, and binds it back to
the local frame. A write back could be a global write back or
a local value write back. A global write back happens when
the control ow leaves the current context, such as a non
built-in function call. It writes back all the modied entries
in the cache. Rs semantic allows a callee to access the
frames of the caller. The global write back ensures the callee
can access the latest value in the caller. A local value write
back is performed before a non-specialized load variable
instruction. This type of instructions still access the local
frames linked list, and the local write back only redenes
the variable accessed by the instruction. Compared with
the original local frame operations, the value cache always
works on unboxed values, and saves numerous operations
needed for traversing the linked list and creating new R
objects.
3.2.4 Operation Specialization
We added many new type specialized byte-code instructions
to the interpreter. By using the result of type inference, OR-
BIT translates the original generic instructions into type spe-
cialized instructions. These new instructions are interpreted
more efciently. Furthermore, thanks to the object repre-
sentation specialization, much of specialized operation only
needs to interact with the new data representation, which is
faster and requires less memory allocation.
Load and Store of scalar values Based on the type infer-
ence result, a scalar objects load instruction just looks up
the variable in the local frame (if not found, it looks up in
the parent frames), unboxes it and puts it into the unboxed
value cache and onto the top of the VM stack. A load con-
stant instruction puts the unboxed value onto the top of the
VM stack. A store instruction only needs to update the value
in the unboxed value cache.
Mathematics Operations The mathematics operations in-
volving scalars are applied to unboxed values, and the result
is also an unboxed value on the VM stack. Mathematics op-
erations involving vectors use the type information to do a
direct dispatch, saving runtime checks.
Conditional Control Flow Operations Most of the condi-
tional control operations use scalar value to determine the
direction of branching. In ORBIT, they can use the unboxed
value on the top of the VM stack to control the branch.
301
Built-in Function Calls Some built-in functions only uses
scalar value as arguments such as the docolon (:) function
which accepts the range limits as arguments. In the byte-
code interpreter, docolon always creates a linked list to store
the arguments. By leveraging the object representation spe-
cialization, we simplify the calling convention, and use un-
boxed scalar values as arguments. Furthermore, if the result
of a docolon function is used only to specify the iteration
space, ORBIT ignores the docolon function call, and uses
the unboxed start and end values directly.
For-loop The for-loop could benet from the known type
of the loop variable. If the loop variable is a number vector,
we store it to the unboxed value cache and delay the store to
the local frame until later. If the loop index variable is a result
of docolon, e.g. for(i in 1:1000), a specialized for-loop
byte-code sequence is generated to directly use the lower
and upper bounds for the loop.
SubSet and SubAssign SubSet and SubAssign are similar
to the built-in function calls. The default calling convention
needs to create an argument list to store the value and the
index. By using the type information and unboxed value, the
new calling convention uses the unboxed arguments.
3.2.5 Guard and Guard Failure Handling
All the specializations depend on the type information of the
stack operands and local variables. However, many types are
inferred from the proled type. This speculated type may
be wrong in subsequent executions of the code segment.
Suppose the variable a in gure 7 is an integer vector
in another function invocation, the specialized instruction
GETREALUNBOX, REALADD, and SETUNBOXVAR can-
not handle the new a. A guard is used to handle this situa-
tion.
Because all the type runtime checks are only needed after
a load or a function call or a subset instruction, a guard check
is appended in the specialized instruction. The types of the
operands on the top of the stack are compared with types
specied in the specialized instruction that follows. If the
types do not match, a guard failure is triggered.
During a guard failure handling, the VM stack will be
restored to its generic form. All the unboxed values in the
VM stack will be boxed using the record in the type stack.
The unboxed value cache will be globally written back. Fi-
nally, the interpreter will switch back to the original generic
byte-code sequence. And the new type of that object will be
recorded for future type inferences.
4. Evaluation
We evaluated the effectiveness of ORBIT by measuring the
running time and the number of memory object allocations,
and compared these values to their counterpart for the GNU
R default interpreter and byte-code interpreter.
4.1 Evaluation Environment and Methodology
The evaluation was performed on a machine with an Intel
Xeon E31245 processor and 8G of memory. We disabled
the turbo boost to x the CPU frequency at 3.3GHz. The
Operating System is Fedora Core 16. We used gcc4.6.3 to
compile both the GNU R interpreter and our ORBIT. All the
R default packages are pre-compiled into byte-code as the
default installation conguration.
Because the current implementation of ORBIT targets
Type I codes, we focus on comparing the running time
and memory allocations between ORBIT and GNU R inter-
preters on Type I codes. In order to measure the maximum
performance improvement at the steady-state, we measured
the running time of ORBIT only when it runs into stable,
not counting the overhead of proling and compiling. We
report the overhead separately in Section 4.5. All the execu-
tion times reported are the average of ve runs. The number
of memory allocation requests was collected inside ORBIT,
which is instrumented to prole memory allocations.
4.2 Micro Benchmark
The rst benchmark suite we measured is a collection of
micro benchmarks, which include CRT (Chinese Reminder
Problems), Fib (Fibonacci number), Primes (Finding prime
numbers), Sum (Accumulation based on loop), GCD (Great-
est Common Divisor). Although these benchmarks mostly
operate on scalars and vectors, they covers a variety of con-
trol constructs such as conditional branches, loops, function
invocations and recursion. They are common across R ap-
plications. The R byte-code compiler[27] uses the similar
benchmarks to measure performance improvements.
7.67
25.60
28.21
27.67
23.14
20.41
2.28
2.79
5.83
3.01
5.14
3.56
0.00
10.00
20.00
30.00
CRT Fib Primes Sum GCD Geo Mean
Speedup over default interpreter Speedup over byte-code interpreter
Figure 11. Speedups on the scalar benchmarks.
Figure 11 shows the speedups of ORBIT over the default
interpreter and the byte-code interpreter. ORBIT is more
than 20X faster than the default interpreter. The R byte-code
interpreter is very good at this type of benchmarks because
it reduces a signicant amount of the interpreting overhead.
With the additional optimization and focusing on memory
allocation reduction, ORBIT achieves an additional 3.56X
speedup over the R byte-code interpreter.
Table 2 shows the percentage of allocated memory in
the bytecode interpreter that is removed by ORBIT. For
instance, ORBIT is able to remove between 80% to 99%
302
Table 2. Percentage of memory allocation reduced for
scalar.
Benchmark SEXPREC VECTOR scalar VECTOR non-scalar
CRT 76.06% 82.83% 97.58%
Fib 99.16% 99.99% 100%
Primes 98.21% 94.70% 50.00%
Sum 15.00% 99.99% 100%
GCD 99.99% 99.99% 25.00%
of scalar objects (labeled as VECTOR scalar in Table 2)
allocated by the bytecode interpreter.
4.3 The shootout Benchmarks
The shootout benchmarks are frequently used in computer
science research to compare the implementation of different
languages and runtimes. The project reported in [19] uses
it to study the behavior of the GNU R implementation. We
ported six of the eleven benchmarks into R. The benchmarks
use a variety of data structures including scalar, vector, ma-
trix, list, which cover most of the data structures in R. The
ignored benchmarks are either related to multi-thread, which
R doesnt support, or heavily operates on characters, which
is not a typical usage of R.
11.41
22.90
9.79
37.09
25.65
2.49
13.50
4.29
5.90
2.24
5.05
6.28
1.37
3.68
0.0
10.0
20.0
30.0
40.0
Speedup over default interpreter Speedup over byte-code interpreter
Figure 12. Speedups on the shootout benchmarks.
As shown in Figure 12, ORBIT achieves a signicant
speedup over the GNU R VM except for binary-trees.
The binary-trees benchmark is dominated by recursive
function call overheads thus its performance is heavily de-
pendent on the efciency of the calling convention. Since our
current implementation does not optimize user-level func-
tion calling convention, improvements on binary-trees
are relatively low.
Table 3. Percentage of memory allocation reduced for
shootout.
Benchmark SEXPREC VECTOR scalar VECTOR non-scalar
nbody 85.47% 86.82% 69.02%
fannkuch-redux 99.99% 99.30% 71.98%
spectral-norm 43.05% 91.46% 99.46%
mandelbrot 99.95% 99.99% 99.99%
pidigits 96.89% 98.37% 95.13%
binary-trees 36.32% 67.14% 0.00%
As shown in Table 3, ORBIT reduces signicant number
of memory allocation of the bytecode interpreter, especially
for VECTOR scalar objects. Table 4 shows detail metrics
taken during the execution of fannkuch-redux. Besides the
reduction in the number of memory allocations, GC time is
also reduced by 95%.
Table 4. Runtime measurements of fannkuch-redux
Metrics byte-code interpreter ORBIT
Machine instructions 1,526M 263M
GC time(ms) 12.06 0.57
SEXPREC object 2,477,740 239,468
VECTOR scalar 2,878,561 20,182
VECTOR non-scalar 854,588 81
Table 5 shows the speedups of ORBIT and other Rperfor-
mance projects over the default interpreter on the shootout
benchmarks. Here NA means that the benchmark did not ex-
ecute successfully. Most of NA come from projects that de-
velop a brand new VM that is not fully compatible with the
GNU R VM. Overall, ORBIT achieves the highest speedup
with full compatibility with the GNU R VM.
4.4 Other Types of Benchmarks
We also evaluated ORBIT on the ATT benchmark [1], the
Riposte benchmark [26] and the benchmarks used in [16].
For Type I dominated codes in these benchmarks, ORBIT
achieves good speedups similar to what we reported for the
scalar and the shootout benchmarks. But it only gets small
improvements (as low as 15% faster) on Type II codes and
nearly no improvements in Type III codes. This is expected
as current implementations of ORBIT focuses exclusively
on Type I codes. Further optimizations for Type II codes are
left for future work. And we do not intend to address Type
III code performance as compilation at R level will not help
Type III codes.
4.5 Proling and Compilation Overhead
The overhead of ORBIT comes from two sources. The rst
one is runtime type proling. Because we only prole a part
of the instructions (Section 3.2.1), the proling overhead is
dependent on the percentage of these instructions. Based on
our measurement, the overhead is less than 10% in most
cases. For example, the overhead of the shootout bench-
marks is less than 8%. Considering the huge potential of the
speedup, this overhead is acceptable.
The second overhead is the JIT time. It is only related to
the size of the benchmark codes (including the benchmark
itself and all the package codes it invokes). The JIT time in
ORBIT is very small, ranging from 2-5 ms for the scalar
benchmarks, and 10-30 ms for the shootout benchmark.
ORBITs JIT is fast compared to other JITs (e.g., Java)
because it focuses on specialization not on data-owanalysis
and does not generate native codes. The JIT time could be
ignored since the running time of the benchmarks in the
bytecode interpreter ranges from seconds to minutes.
303
Table 5. Speedups of ORBIT and other R projects on shootout over the default R interpreter.
nbody fannkuch-redux spectral-norm mandelbrotm pidigits binary-trees Geomean
Byte-code Interpreter 2.66 3.88 4.37 7.34 4.08 1.82 3.67
Riposte NA 6.86 14.46 13.68 NA 6.85 NA
Renjin 0.90 1.00 0.85 0.95 0.93 0.21 NA
FastR NA 8.28 NA 9.88 7.03 2.49 NA
pqR+default interpreter 1.83 2.26 2.54 2.49 2.35 1.69 2.17
pqR+bytecode interpreter 3.06 4.84 5.29 8.00 4.51 1.84 4.16
TERR 1.00 1.36 1.10 1.48 1.41 0.65 1.12
ORBIT 11.41 22.90 9.79 37.09 25.65 2.49 13.50
5. Related Work
5.1 R Optimization
We have already summarized different approaches to opti-
mizing R VM performance in Figure 3 and in the intro-
duction. We have also evaluated the performance of some
related projects that are publically available on our Type I
benchmarks in Table 5. In this section, we just briey de-
scribe some of the related projects.
Many R projects take the approach of completely rewrite
the VM including its object model, the interpreter, and
the runtime. Several were implemented in Java such as
fastR [16], Renjin [5] and TrufeR [30]. The performance
of fastR and Renjin on Type I codes is reported in Table 5.
TrufeR is not yet publicly available but results reported at
[4] showed impressive speedups for Type I and II codes. Ri-
poste [26] is another new R VM for Type II codes focusing
on vector operation fusion, parallelism, and lazy evaluation.
Rapydo [14] implements a new R VM based on the PyPy
framework. Rllvm [17] translates dialect of a subset of R
constructs with type annotation into LLVM IL.
Enterprise R [7] targets on Type III codes via high per-
formance mathematics packages. TERR [15] takes a similar
approach but can improve both Type II and III codes.
The R byte-code interpreter [27] is the rst major attempt
from the R core team to improve R performance. However,
it does not change the memory representation of R, and
still suffers from the excessive memory allocation pressure.
pqR [20] is also an improvement to GNU R including using
multi-threading for Type II codes.
5.2 Specialization in Dynamic Languages
SELF[9, 10] is an early pioneer in specialization of dynamic
languages. It uses statically inferred type information to gen-
erate several versions of a code, and insert runtime checks of
the type speculation failure. Many other optimization tech-
niques used in current dynamic languages could be traced
back to SELF. MaJIC [6] describes a compilation framework
for Matlab that does type specialization. [29] introduces a
new dynamic interpretation strategy based on type special-
ization. The interpreter chooses and execution paths accord-
ing to the actual runtime types.
There are much recent research on JavaScript specializa-
tion. TraceMonkey [12] uses traces of hot loops to do type
specialization and generates efcient native code. Hackett
et.al. [13] uses a hybrid typing algorithmthat combines static
type inference and dynamic type update to do type special-
ization and improve the native code generation. Costa, et.al.
[25] provides a JIT value specialization way to generate spe-
cialized code according to the parameters type of a function.
5.3 General Optimization in Dynamic Languages
JIT Native Code Generation JIT native code generation
is a common approach to improve the running time of dy-
namic languages. A partial ist incudes Mathlab[6], Lua[21],
Python[23], JavaScript[12, 13, 25], R[14, 26]. But directly
generating native code from type generic byte-code or data
representation could only get a moderate benet[8]. Special-
ization is very important for dynamic languages.
Source to Source Translation One approach to improve
the performance of dynamic language is translating the dy-
namic language into a low level static language, for example
PHP HipHop[31], Matlab FALCON [11]. Similar to the na-
tive code generation, the nal performance is still heavily
dependent on the specialization.
6. Conclusion and Future Work
This paper has described ORBIT, an extension to the GNU R
VM, to improve R performance via interpreter-level prole-
driven specialization. Without native code generation, OR-
BIT achieves up to 3.5x speedup over the byte-code inter-
preter on a set of Type I R codes.
For future work, we would like explore new optimiza-
tions to reduce the overhead in Rs calling convention. This
optimization will benet both Type I and II codes. We would
also like to investigate optimizations for Type II codes.
Acknowledgments
We thank Luke Tierney for his suggestions and help to our
work, and Jan Vitek for his help in collecting the bench-
marks. We also thank anonymous reviewers and attendees
to the DALI workshop [4] for their valuable comments. This
material is based upon work supported by the National Sci-
ence Foundation under Award CNS 1111407.
References
[1] R benchmarks, 2008. http://r.research.att.com/benchmarks/.
304
[2] Bioconductor: Open source software for Bioinformatics,
2013. http://www.bioconductor.org/.
[3] The computer language benchmarks game (CLBG), 2013.
http://benchmarksgame.alioth.debian.org/.
[4] NSF DALI workshop on dynamic languages for scalable data
analytics, 2013. http://www.ws13.dynali.org.
[5] Renjin: The R programming language on the JVM, 2013.
http://www.renjin.org/.
[6] George Alm asi and David Padua. MaJIC: compiling MAT-
LAB for speed and responsiveness. In Proceedings of the
ACM SIGPLAN 2002 Conference on Programming language
design and implementation, PLDI02, pages 294303, 2002.
[7] Revolution Analytics. Revolution R enterprise, 2013.
http://www.revolutionanalytics.com/products/revolution-
enterprise.php.
[8] Jose Castanos, David Edelsohn, Kazuaki Ishizaki, Priya Nag-
purkar, Toshio Nakatani, Takeshi Ogasawara, and Peng Wu.
On the benets and pitfalls of extending a statically typed
language JIT compiler for dynamic scripting languages. In
Proceedings of the ACM international conference on Object
oriented programming systems languages and applications,
OOPSLA12, pages 195212. ACM, 2012.
[9] C. Chambers and D. Ungar. Customization: optimizing
compiler technology for SELF, a dynamically-typed object-
oriented programming language. In Proceedings of the ACM
SIGPLAN1989 Conference on Programming language design
and implementation, PLDI89, pages 146160, 1989.
[10] C. Chambers, D. Ungar, and E. Lee. An efcient implemen-
tation of SELF a dynamically-typed object-oriented language
based on prototypes. In Conference proceedings on Object-
oriented programming systems, languages and applications,
OOPSLA89, pages 4970, 1989.
[11] Luiz De Rose and David Padua. Techniques for the translation
of MATLAB programs into Fortran 90. ACM Trans. Program.
Lang. Syst., 21(2):286323, March 1999.
[12] Andreas Gal, Brendan Eich, Mike Shaver, David Anderson,
David Mandelin, Mohammad R. Haghighat, Blake Kaplan,
Graydon Hoare, Boris Zbarsky, Jason Orendorff, Jesse Rud-
erman, Edwin W. Smith, Rick Reitmaier, Michael Bebenita,
Mason Chang, and Michael Franz. Trace-based just-in-time
type specialization for dynamic languages. In Proceedings
of the 2009 ACM SIGPLAN conference on Programming lan-
guage design and implementation, PLDI09, pages 465478.
ACM, 2009.
[13] Brian Hackett and Shu-yu Guo. Fast and precise hybrid type
inference for JavaScript. In Proceedings of the 33rd ACM
SIGPLAN conference on Programming Language Design and
Implementation, PLDI12, pages 239250, 2012.
[14] Sven Hager. Implementing the R language using RPython.
Masters thesis, Institut F ur InformaAtik. Softwaretechnik
und Programmiersprachen. D usseldorft Universit atsstr, 2012.
[15] TIBCO Software Inc. TIBCO enterprise runtime for
R, 2013. http://tap.tibco.com/storefront/trialware/tibco-
enterprise-runtime-for-r/prod15307.html.
[16] Tomas Kalibera, Flor eal Morandat, Petr Maj, and Jan Vitek.
FastR, 2013. https://github.com/allr/fastr.
[17] Duncan Temple Lang. The Rllvm package, 2013.
http://www.omegahat.org/Rllvm/.
[18] Andrew McAfee and Erik Brynjolfsson. Big data: The man-
agement revolution. Harvard Business Review, October 2012.
[19] Flor eal Morandat, Brandon Hill, Leo Osvald, and Jan Vitek.
Evaluating the design of the r language: objects and functions
for data analysis. In Proceedings of the 26th European con-
ference on Object-Oriented Programming, ECOOP12, pages
104131, 2012.
[20] Radford M. Neal. pqR - a pretty quick version of R, 2013.
http://radfordneal.github.io/pqR/.
[21] Mike Pall. The LuaJIT project, 2013. http://luajit.org/.
[22] R project. CRAN: The comprehensive r archive network,
2013. http://cran.r-project.org/.
[23] Armin Rigo and Samuele Pedroni. PyPys approach to vir-
tual machine construction. In Companion to the 21st ACM
SIGPLAN symposium on Object-oriented programming sys-
tems, languages, and applications, OOPSLA06, pages 944
953, 2006.
[24] Andrew Runnalls. CXXR: Refactorising R into C++, 2011.
http://www.cs.kent.ac.uk/projects/cxxr/.
[25] Henrique Nazare Santos, Pericles Alves, Igor Costa, and Fer-
nando Magno Quintao Pereira. Just-in-time value specializa-
tion. In Proceedings of the 2013 IEEE/ACM International
Symposium on Code Generation and Optimization (CGO),
CGO13, pages 111. IEEE Computer Society, 2013.
[26] Justin Talbot, Zachary DeVito, and Pat Hanrahan. Riposte:
a trace-driven compiler and parallel VM for vector code in
R. In Proceedings of the 21st international conference on
Parallel architectures and compilation techniques, PACT12,
pages 4352, 2012.
[27] Luke Tierney. Compiling R: A preliminary report. In Pro-
ceedings of the 2nd International Workshop on Distributed
Statistical Computing, DSC2001, March 2001.
[28] Ashlee Vance. Data analysts captivated by Rs power. New
York Times, January 2009.
[29] Kevin Williams, Jason McCandless, and David Gregg. Dy-
namic interpretation for dynamic scripting languages. In Pro-
ceedings of the 8th annual IEEE/ACM international sympo-
sium on Code generation and optimization, CGO10, pages
278287, 2010.
[30] Thomas W urthinger, Christian Wimmer, Andreas W o, Lukas
Stadler, Gilles Duboscq, Christian Humer, Gregor Richards,
Doug Simon, and Mario Wolczko. One VM to rule them all.
In Proceedings of the 2013 ACM International Symposium on
NewIdeas, NewParadigms, and Reections on Programming;
Software, Onward! 13, pages 187204, New York, NY, USA,
2013. ACM.
[31] Haiping Zhao, Iain Proctor, Minghui Yang, Xin Qi, Mark
Williams, Qi Gao, Guilherme Ottoni, Andrew Paroski, Scott
MacVicar, Jason Evans, and Stephen Tu. The HipHop com-
piler for PHP. In Proceedings of the ACM international con-
ference on Object oriented programming systems languages
and applications, OOPSLA12, pages 575586, 2012.
305

You might also like