LLVA: A Low-Level Virtual Instruction Set Architecture
LLVA: A Low-Level Virtual Instruction Set Architecture
Vikram Adve Chris Lattner Michael Brukman Anand Shukla Brian Gaeke
Computer Science Department
University of Illinois at Urbana-Champaign
{vadve,lattner,brukman,ashukla,gaeke}@cs.uiuc.edu
system. The type system is very simple, consisting of prim- pointer would be 20 bytes and 32 bytes respectively.
itive types with predefined sizes (ubyte, uint, float, dou-
ble, etc...) and 4 derived types (pointer, array, structure, and 3.2. Representation Portability
function). We chose this small set of derived types for two As noted in Section 2, a key design goal for a V-ISA is
reasons. First, we believe that most high-level language data to maintain object code portability across a family of pro-
types are eventually represented using some combination of cessor implementations. LLVA is broadly aimed to support
these low-level types, e.g., a C++ class with base classes and general-purpose uniprocessors (Section 3.6 discusses some
virtual functions is usually represented as a nested structure possible extensions). Therefore, it is designed to abstract
type with data fields and a pointer to an constant array of away implementation details in such processors, including
function pointers. Second, standard language-independent the number and types of registers, pointer size, endianness,
optimizations use only some subset of these types (if any), stack frame layout, and machine-level calling conventions.
including optimizations that require array dependence anal- The stack frame layout is abstracted by using an explicit
ysis, pointer analysis (even field-sensitive algorithms [16]), alloca instruction to allocate stack space and return a
and call graph construction. (typed) pointer to it, making all stack operations explicit.
All instructions in the V-ISA have strict type rules, and As an example, the V variable in Figure 2(a) is allocated
most are overloaded by type (e.g. ‘add int %X, %Y’ on the stack (instead of in a virtual register) because its ad-
vs. ‘add float %A, %B’). There are no mixed-type op- dress is taken for passing to Sum3rdChildren. In prac-
erations and hence, no implicit type coercion. An explicit tice, the translator preallocates all fixed-size alloca objects
cast instruction is the sole mechanism to convert a regis- in the function’s stack frame at compile time.
ter value from one type to another (e.g. integer to floating The call instruction provides a simple abstract calling
point or integer to pointer). convention, through the use of virtual register or constant
The most important purpose of the type system, however, operands. The actual parameter passing and stack adjust-
is to enable typed memory access. LLVA achieves this via ment operations are hidden by this abstract, but low-level,
type-safe pointer arithmetic using the getelementptr instruction.
instruction. This enables pointer arithmetic to be expressed Pointer size and endianness of a hardware imple-
directly in LLVA without exposing implementation details, mentation are difficult to completely to abstract away.
such as pointer size or endianness. To do this, offsets are Type-safe programs can be compiled to LLVA ob-
specified in terms of abstract type properties (field number ject code will be automatically portable, without expos-
for a structure and element index for an array). ing such I-ISA details. Non-type-safe code, however, (e.g.,
In the example, the %tmp.1 getelementptr instruc- machine-dependent code in C that is conditionally com-
tion calculates the address of T[0].Children[3], by piled for different platforms) requires exposing such details
using the symbolic indexes 0, 1, and 3. The “1” index is of the actual I-ISA configuration. For this reason, LLVA in-
a result of numbering the fields in the structure. On sys- cludes flags for properties that the source-language com-
tems with 32-bit and 64-bit pointers, the offset from the %T piler can expose to the source program (currently, these are
pointer size and endianness). This information is also en- 3.4. Self-modifying and Self-extending Code
coded in the object file so that, using this information, the
We use the term Self-Modifying Code (SMC) for a
translator for a different hardware I-ISA can correctly exe-
program that explicitly modifies its own pre-existing in-
cute the object code (although this emulation would incur
structions. We use the term Self-Extending Code (SEC)
a substantial performance penalty on I-ISAs without hard-
to refer to programs in which new code is added at run-
ware support).
time, but that do not modify any pre-existing code. SEC
3.3. Exception Semantics encompasses several behaviors such as class loading in
Java [17], function synthesis in higher-order languages,
Previous experience with virtual processor architectures, and program-controlled dynamic code generation. SEC is
particularly DAISY and Transmeta, show that there are generally much less problematic for virtual architectures
three especially difficult features to emulate in traditional than SMC. Furthermore, most commonly cited examples of
hardware interfaces: load/store dependences, precise excep- “self-modifying code” (e.g., dynamic code generation for
tions, and self-modifying code. The LLVA V-ISA already very high performance kernels or dynamic code loading in
simplifies detecting load/store dependences in one key way: operating systems and virtual machines) are really exam-
the type, control-flow, and SSA information enable sophisti- ples of SEC rather than SMC. Nevertheless, SMC can be
cated alias analysis algorithms in the translator, as discussed useful for implementing runtime code modifications in cer-
in 5.1. For the other two issues also, we have the opportu- tain kinds of tools such as runtime instrumentation tools or
nity to minimize their impact through good V-ISA design. dynamic optimization systems.
Precise exceptions are important for implementing many LLVA allows arbitrary SEC, and allows a constrained
programming languages correctly (without overly complex form of SMC that exploits the execution model for the
or inefficient code), but maintaining precise exceptions V-ISA. In particular, a program may modify its own (vir-
greatly restricts the ability of compiler optimizations to re- tual) instructions via a set of intrinsic functions, but such a
order code. Static compilers often have knowledge about change only affects future invocations of that function, not
operations that cannot cause exceptions (e.g., a load of a any currently active invocations. This ensures that SMC can
valid global in C), or operations whose exceptions can be be implemented efficiently and easily by the translator, sim-
ignored for a particular language (e.g., integer overflow in ply by marking the function’s generated code invalid, forc-
many languages). ing it to be regenerated the next time the function is invoked.
We use two simple V-ISA rules to retain precise excep-
tions but expose non-excepting operations to the translator: 3.5. Support for Operating Systems
• Each LLVA instruction defines a set of possible excep- LLVA uses two key mechanisms to support operating
tions that can be caused by executing that instruction. systems and user-space applications: intrinsic functions and
Any exception delivered to the program is precise, in a privileged bit. LLVA uses a small set of intrinsic functions
terms of the visible state of an LLVA program. to support operations like manipulating page tables and
• Each LLVA instruction has a boolean attribute named other kernel operations. These intrinsics are implemented
ExceptionsEnabled. Exceptions generated by an by the translator for a particular target. Intrinsics can be de-
instruction are ignored if ExceptionsEnabled is fined to be valid only if the privileged bit is set to true, oth-
false for that instruction; otherwise all exception con- erwise causing a kernel trap. A trap handler is an ordinary
ditions are delivered to the program. Exceptions- LLVA function with two arguments: the trap number and a
Enabled is true by default for load, store and pointer of type void* to pass in additional information to
div instructions. It is false by default for all other op- the handler. Trap handlers can refer to the register state of
erations, notably all arithmetic operations. an LLVM program using a standard, program-independent
register numbering scheme for virtual registers. Other in-
Note also that the ExceptionsEnabledattribute is a trinsic functions can be used to traverse the program stack
static attribute and is provided in addition to other mecha- and scan stack frames in an I-ISA-independent manner, and
nisms provided by the V-ABI to disable exceptions dynam- to register the entry points for trap handlers.
ically at runtime (e.g. for use in trap handlers).
A second attribute for instructions we are considering
3.6. Possible Extensions to the V-ISA
would allow exceptions caused by the instruction to be Thre are two important kinds of functionality that could
delivered without being precise. Static compilers for lan- be added to the V-ISA. First, the architecture certainly re-
guages like C and C++ could flag many untrapped exception quires definition of synchronization operations and a mem-
conditions (e.g., memory faults) in this manner, allowing ory model to support parallel programs (these primitives are
the translator to reorder such operations more freely (even difficult to make universal, and thus may have to be de-
if the hardware only supported precise exceptions). fined with a family of implementations in mind). Second,
profile information to offline storage. It can exploit all the
Application Operating System optimization mechanisms enabled by the V-ISA, described
Software below. Such a processor should obtain all the benefits of a
Kernel Storage
V−ISA
I−ISA code
VISC design without any need for online translation.
Profile info More commonly, however, a processor is designed with
Execution Manager
Optional
translator no assumptions about the OS or available storage. The
components
Code
Profiling
Static/Dynamic lack of such knowledge places constraints on the transla-
Generation Optimization
tor, as can be seen in DAISY’s and Crusoe’s translation
I−ISA schemes [11, 14]. Not only is the entire translator program
Hardware Processor
located in ROM, but the translated code and any associ-
Figure 3. The LLVA execution manager and ated profile information live only in memory and are never
interface to offline storage. cached in persistent storage between executions of a pro-
gram. Consequently, programs are always translated online
after being launched, if the translation does not exist in an
packed operations (also referred to as subword parallelism) in-memory cache.
are valuable to media and signal-processing codes. These We propose a translation strategy for such a situation that
operations must be encoded in the V-ISA because it is diffi- can enable offline translation and caching, if an OS ported
cult for the translator to automatically synthesize them from to LLVA chooses to exploit it. We have developed a transpar-
ordinary sequential code. Finally, we are developing V- ent execution environment called LLEE that embodies this
ISA extensions that provide machine-independent abstrac- strategy, though it is currently implemented at user-level on
tions for chip parallelism. These extensions could be valu- a standard POSIX system, as described below. It is depicted
able as explicit on-chip parallelism becomes more prevalent in Figure 3.
(e.g., [33, 21, 31]), raising potentially serious challenges for The LLEE translation strategy can be summarized as
preserving portability while achieving the highest possible “offline translation when possible, online translation when-
performance across different generations of processors. ever necessary.” A subset (perhaps all) of the translator suf-
ficient for translation and some set of optimizations would
4. Translation Strategy live in ROM or flash memory on the processor chip. It is in-
The goals of our translation strategy are to (a) minimize voked only by LLEE. The V-ABI defines a standard, OS-
the need for online translation, and (b) to exploit the novel independent interface with a set of routines that enables
optimization capabilities enabled by a rich, persistent code LLEE to read, write, and validate data in offline storage.
representation. This paper does not aim to develop new opti- This interface is the sole “gateway” that LLEE could use
mization techniques. We are developing such techniques in to call into the OS. An OS ported to LLVA can choose to
ongoing research, as part of a complete framework for life- implement these routines for higher performance, but they
long code optimization on ordinary processors [26]. Here, are strictly optional and the system will operate correctly in
we focus on the VISC translation strategy and on the impli- their absence.
cations of the optimization capabilities for VISC designs. Briefly, the basic gateway includes routines to create,
We begin by describing the “on-chip” runtime execution delete, and query the size of an offline cache, read or write
engine (LLEE) that manages the translation process. We fo- a vector of N bytes tagged by a unique string name from/to
cus in particular on strategies by which it interacts with the a cache, and check a timestamp on an LLVA program or on
surrounding software system to get access to offline stor- a cached vector. Because these routines are implemented by
age and enable offline translation. We then describe how the OS, and so cannot be linked into the translator, we also
the translation strategy exploits the optimization capabili- define one special LLVA intrinsic routine (recall that an in-
ties enabled by a rich persistent code representation. trinsic is a function implemented by the translator) that the
OS can use at startup to register the address of the gateway
4.1. LLEE: OS-Independent Translation System routine with the translator. This gateway routine can then
We distinguish two scenarios with different primary con- be called directly by the translator to query the addresses of
straints on the translation system. The first is when a pro- other gateway routines, also at startup. This provides a sim-
cessor is designed or optimized for a particular OS (e.g., ple but indefinitely extensible linkage mechanism between
PowerPCs customized for AS/400 systems running IBM’s translator and OS.
OS/400 [9]). For a VISC processor in such a scenario, the LLEE orchestrates the translation process as follows.
translator can live in offline storage as part of the OS, it can When the OS loads and transfers control to an LLVA exe-
be invoked to perform offline translation, and it can use OS- cutable in memory, LLEE is invoked by the processor hard-
specific interfaces directly to read and write translations and ware. If the OS gateway has been implemented, LLEE uses
it to look for a cached translation of the code, checks its retains rich enough information to support extensive opti-
timestamp if it exists, and reads it into memory if the trans- mizations, as demonstrated in 5.1.
lation is not out of date. If successful, LLEE performs re- Install-time optimization is just an application of the
location as necessary on the native code and then transfers translator’s optimization and code generation capabilities to
control to it directly. If any condition fails, LLEE invokes generate carefully tuned code for a particular system con-
the JIT compiler on the entry function. Any new translated figuration. This is a direct benefit of retaining a rich code
code generated by the JIT compiler can be written back to representation until software is installed, while still retain-
the offline cache if the gateway is available. During idle ing the ability to do offline code generation.
times, the OS can notify LLEE to perform offline transla- Unlike other trace-driven runtime optimizers for native
tion of an LLVA program by initiating “execution” as above, binary code, such as Dynamo [4], we have both the rich
but flagging it for translation and not actual execution. V-ISA and a cooperating code generator. Our V-ISA pro-
Our implementation of LLEE is faithful to this descrip- vides us with ability to perform static instrumentation to
tion except: (a) LLEE is a user-level shared library that is assist runtime path profiling, and to use the CFG at run-
loaded when starting a shell. This library overrides ex- time to perform path profiling within frequently executed
ecve() with a new version that recognizes LLVA exe- loop regions while avoiding interpretation. It also lets us de-
cutables and either invokes the JIT on them or executes the velop an aggressive optimization strategy that operates on
cached native translations from the disk, using a user-level traces of LLVA code corresponding to the hot traces of na-
version of our gateway. (b) Both the JIT and offline compil- tive code. We have implemented the tracing strategy and
ers are ordinary programs running on Solaris and Linux, and software trace cache, including the ability to gather cross-
the offline compiler reads and writes disk files directly. (c) procedure traces, [26], and we are now developing runtime
LLVA executables can invoke native libraries not yet com- optimizations that exploit these traces.
piled to LLVA, e.g., the X11 library. The rich information in LLVA also enables “idle-time”
profile-guided optimization (PGO) using the translator’s op-
4.2. Optimization Strategy timization and code generation capabilities. The important
The techniques above make it possible to perform offline advantage is that this step can use profile information gath-
translation for LLVA executables, even with a completely ered from executions on an end-user’s system. This has
OS-independent processor design. There are also important three distinct advantages over static PGO: (a) the profile in-
new optimization opportunities created by the rich V-ISA formation is more likely to reflect end-user behavior than
code representation, that a VISC architecture can exploit, hypothetical profile information generated by developers
but most of which are difficult to for programs compiled di- using predicted input sets; (b) developers often do not use
rectly to native code. These include: profile-guided optimization or do so only in limited ways,
whereas “idle-time” optimization can be completely trans-
1. Compile-time and link-time machine-independent op- parent to users, if combined with low-overhead profiling
timization (outside the translator). techniques; and (c) idle-time optimization can combine pro-
2. Install-time, I-ISA-specific optimization (before trans- file information with detailed information about the user’s
lation). specific system configuration.
3. Runtime, trace-driven machine-specific optimization. 5. Initial Evaluation
4. “Idle-time” (between executions) profile-guided, We believe the performance implications of a Virtual
machine-specific optimization using profile informa- ISA design cannot be evaluated meaningfully without (at
tion reflecting actual end-user behavior. least) a processor design with hardware mechanisms that
support translation and optimization [11]), and (preferably)
As noted earlier, the LLVA representation allow substan- basic cooperative hardware/software mechanisms that ex-
tial optimization to be performed before translation, mini- ploit the design. Since the key contribution of this paper is
mizing optimization that must be performed online. Of this, the design of LLVA, we focus on evaluating the features
optimization at link-time is particularly important because of this design. In particular, we consider the 2 questions
it is the first time that most or all modules of an application listed in the Introduction: does the representation enable
are simultaneously available, without requiring changes to high-level analysis and optimizations, and is the represen-
application Makefiles and without sacrificing the key bene- tation low-level enough to closely match with hardware and
fits of separate compilation. In fact, many commercial com- to be translated efficiently?
pilers today perform interprocedural optimization at link-
time, by exporting their proprietary compiler internal repre- 5.1. Supporting High Level Optimizations
sentation during static compilation [3, 20]. Such compiler- The LLVA code representation presented in this paper is
specific solutions are unnecessary with LLVA because it also used as the internal representation of a sophisticated
compiler framework we call Low Level Virtual Machine to support powerful (language-independent) compiler tasks
(LLVM) [26]. LLVM includes front-ends for C and C++ traditionally performed only in source-level compilers.
based on GCC, code generators for both Intel IA-32 and 5.2. Low-level Nature of the Instruction Set
SPARC V9 (each can be run either offline or as a JIT com-
piling functions on demand), a sophisticated link-time op- Table 2 presents metrics to evaluate the low-level nature
timization system, and a software trace cache. Compared of the LLVA V-ISA. The benchmarks we use include the
with the instruction set in Section 3, the differences in the PtrDist benchmarks [2] and the SPEC CINT2000 bench-
compiler IR are: (a) the compiler extracts type information marks (we omit three SPEC codes because their LLVAob-
for memory allocation operations and converts them into ject code versions fail to link currently). The first two
typed malloc and free instructions (the back-ends trans- columns in the table list the benchmark names and the num-
late these back into the library calls), and (b) the Excep- ber of lines of C source code for each.
tionEnabled bit is hardcoded based on instruction op- Columns 3 and 4 in the table show the fully linked code
code. The compiler system uses equivalent internal and ex- sizes for a statically compiled native executable and for the
ternal representations, avoiding the need for complex trans- LLVA object program. The native code is generated from
lations at each stage of the compilation process. the LLVA object program using our static back end for
SPARC V9. These numbers are comparable because they
The compiler uses the virtual instruction set for a va-
reflect the same LLVA optimizations were applied in both
riety of analyses and optimizations including many classi-
cases. The numbers show that the virtual object code is sig-
cal dataflow and control-flow optimizations, as well as more
nificantly smaller than the native code, roughly 1.3x to 2x
aggressive link-time interprocedural analyses and transfor-
for the larger programs in the table (the smaller programs
mations. The classical optimizations directly exploit the
have even larger ratios)2 . Overall, despite containing ex-
control-flow graph, SSA representation, and several choices
tra type and control flow information and using SSA form,
of pointer analysis. They are usually performed on a per
the virtual code is still quite compact for two reasons. First,
module-basis, before linking the different LLVA object code
most instructions usually fit in a single 32-bit word. Second,
modules, but can be performed at any stage of a program’s
the virtual code does not include verbose machine-specific
lifetime where LLVA code is available.
code for argument passing, register saves and restores, load-
We also perform several novel interprocedural tech- ing large immediate constants, etc.
niques using the LLVA representation, all of which operate The next five columns show the number of LLVA instruc-
at link-time. Data Structure Analysis is an efficient, context- tions, the total number of machine instructions generated by
sensitive pointer analysis, which computes both an accu- the X86 back-end, and the ratio of the latter to the former
rate call graph and points-to information. Most importantly, (also for Sparc). This back-end performs virtually no op-
it is able to identify information about logical data struc- timization and very simple register allocation resulting in
tures (e.g., an entire list, hashtable, or graph), including significant spill code. Nevertheless, each LLVA instruction
disjoint instances of such structures, their lifetimes, their translates into very few I-ISA instructions on average; about
internal static structure, and external references to them. 2-3 for X86 and 3-4 for SPARC V9. Furthermore, all LLVA
Automatic Pool Allocation is a powerful interprocedural instructions are translated directly to native machine code
transformation that uses Data Structure Analysis to parti- – no emulation routines are used at all. These results indi-
tion the heap into separate pools for each data structure in- cate that the LLVA instruction set uses low-level operations
stance [25]. Finally, we have shown that the LLVA repre- that match closely with native hardware instructions.
sentation is rich enough to perform complete, static analy- Finally, the last three columns in the table show the to-
sis of memory safety for a large class of type-safe C pro- tal code generation time taken by the X86 JIT compiler to
grams [24, 13]. This work uses both the techniques above, compile the entire program (regardless of which functions
plus an interprocedural array bounds check removal algo- are actually executed), the total running time of each pro-
rithm [24] and some custom interprocedural dataflow and gram when compiled natively for X86 using gcc -O3, and
control flow analyses [13]. the ratio of the two. As the table shows, the JIT compila-
The interprocedural techniques listed above are tradi- tion times are negligible, except for large codes with short
tionally considered very difficult even on source-level im- running time. Furthermore, this behavior should extend to
perative languages, and are impractical for machine code. In much larger programs as well because the JIT translates
fact, all of these techniques fundamentally require type in- functions on demand, so that unused code is not translated
formation for pointers, arrays, structures and functions in (we show the compilation time for the entire program, since
LLVA plus the Control Flow Graph. The SSA represen- that makes the data easier to understand). Overall, this data
tation significantly improves both the precision and speed
of the analyses and transformations. Overall, these exam- 2 The GCC compiler generates more compact SPARC V8 code, which
ples amply demonstrate that the virtual ISA is rich enough is roughly equal in size to the bytecode [26].
Program #LOC Native LLVM code #LLVM #X86 Ratio #SPARC Ratio Translate Run Ratio
size (KB) size (KB) Inst. Inst. Inst. Time (s) time (s)
ptrdist-anagram 647 21.7 10.7 776 1817 2.34 2550 3.29 0.0078 1.317 0.006
ptrdist-ks 782 24.9 12.1 1059 2732 2.58 4446 4.20 0.0039 1.694 0.002
ptrdist-ft 1803 20.9 10.1 799 1990 2.49 2818 3.53 0.0117 2.797 0.004
ptrdist-yacr2 3982 58.3 36.5 4279 10881 2.54 12252 2.86 0.0429 2.686 0.016
ptrdist-bc 7297 112.0 74.4 7276 19286 2.65 25697 3.53 0.1308 1.307 0.100
179.art 1283 37.8 17.9 2027 5385 2.66 7031 3.47 0.0253 114.723 0.000
183.equake 1513 44.4 23.9 2863 6409 3.14 8275 2.89 0.0273 18.005 0.002
181.mcf 2412 32.0 17.3 2039 4707 2.31 4601 2.26 0.0175 24.516 0.001
256.bzip2 4647 73.5 55.7 5103 11984 2.35 14157 2.77 0.0371 20.896 0.002
164.gzip 8616 94.0 68.6 7594 17500 2.30 20880 2.75 0.0527 19.332 0.003
197.parser 11391 223.0 175.3 17138 41671 2.43 57274 3.34 0.1601 4.718 0.034
188.ammp 13483 265.1 163.2 21961 53529 2.44 67679 3.08 0.1074 58.758 0.002
175.vpr 17729 331.0 184.4 18041 58982 3.27 74696 4.14 0.1425 7.924 0.018
300.twolf 20459 487.7 330.0 45017 104613 2.32 119691 2.66 0.0156 9.680 0.002
186.crafty 20650 555.5 336.4 34080 104093 3.05 110630 3.25 0.4531 15.408 0.029
255.vortex 67223 976.3 719.3 72039 195648 2.72 224488 3.12 0.7773 6.753 0.115
254.gap 71363 1088.1 854.4 111482 246102 2.21 272483 2.44 0.4824 3.729 0.129
Table 2. Metrics demonstrating code size and low-level nature of the V-ISA
shows that it is possible to do a very fast, non-optimizing The IBM AS/400, building on early ideas in the S/38, de-
translation of LLVA code to machine code at very low cost. fined a Machine Interface (MI) that was very high-level, ab-
Any support to translate code offline and/or to cache trans- stract and hardware-independent (e.g., it had no registers or
lated code offline should further reduce the impact of this storage locations). It was the sole interface for all applica-
translation cost. tion software and for much of OS/400. Their design, how-
Overall, both the instruction count ratio and the JIT com- ever, differed from ours in fundamental ways, and hence
pilation times show that the LLVA V-ISA is very closely does not meet the goals we laid out in Section 2. Their MI
matched to hardware instruction sets in terms of the com- was targeted at a particular operating system (the OS/400),
plexity of the operations, while the previous subsection it was designed to be implemented using complex operating
showed that it includes enough high-level information for system and database services and not just a translator, and
sophisticated compiler optimizations. This combination of was designed to best support a particular workload class,
high-level information with low-level operations is the cru- viz., commercial database-driven workloads. It also had a
cial feature that (we believe) makes the LLVA instruction far more complex instruction set than ours (or any CISC
set a good design for a Virtual Instruction Set Architecture. processors), including string manipulation operations, and
“object” manipulation operations for 15 classes of objects
6. Related Work (e.g., programs and files). In contrast, our V-ISA is philo-
sophically closer to modern processor instruction sets in be-
Virtual machines of different kinds have been widely ing a minimal, orthogonal, load/store architecture; it is OS-
used in many software systems, including operating sys- independent and requires no software other than a transla-
tems (OS), language implementations, and OS and hard- tor; and it is designed to support modern static and dynamic
ware emulators. These uses do not define a Virtual ISA optimization techniques for general-purpose software.
at the hardware level, and therefore do not directly benefit
processor design (though they may influence it). The chal- DAISY [14] developed a dynamic translation scheme for
lenges of using two important examples – Java Virtual Ma- emulating multiple existing hardware instruction sets (Pow-
chine and Microsoft CLI – as a processor-level virtual ISA erPC, Intel IA-32, and S/390) on a VLIW processor. They
were discussed in the Introduction. developed a novel translation scheme with global VLIW
scheduling fast enough for online use, and hardware exten-
We know of four previous examples of VISC archi-
sions to assist the translation. Their translator operated on
tectures, as defined in Section 1: the IBM System/38
a page granularity. Both the DAISY and Transmeta transla-
and AS/400 family [9], the DAISY project at IBM Re-
tors are stored entirely in ROM on-chip. Because they fo-
search [14], Smith et al.’s proposal for Codesigned Virtual
cus on existing V-ISAs with existing OS/hardware interface
Machines in the Strata project [32], and Transmeta’s Cru-
specifications, they cannot assume any OS support and thus
soe family of processors [23, 11]. All of these distinguish
cannot cache any translated code or profile information in
the virtual and physical ISAs as a fundamental proces-
off-processor storage, or perform any offline translation.
sor design technique. To our knowledge, however, none
except the IBM S/38 and AS/400 have designed a virtual in- Transmeta’s Crusoe uses a dynamic translation scheme
struction set for use in such architectures. to emulate Intel IA-32 instructions on a VLIW hardware
processor [23]. The hardware includes important support- optimization strategy but without the benefits of a rich V-
ing mechanisms such as shadowed registers and a gated ISA. Many JIT compilers for Java, Self, and other languages
store buffer for speculation and rollback recovery on ex- combine fast initial compilation with adaptive reoptimiza-
ceptions, and alias detection hardware in the load/store tion of “hot” methods (e.g., see [1, 6, 18, 34]). Finally,
pipeline. Their translator, called Code Morphing Software many hardware techniques have been proposed for improv-
(CMS), exploits these hardware mechanisms to reorder in- ing the effectiveness of dynamic optimization [27, 30, 35].
structions aggressively in the presence of the challenging When combined with a rich V-ISA that supports more ef-
features identified in Section 3.3, namely, precise excep- fective program analyses and transformations, these soft-
tions, memory dependences, and self-modifying code (as ware and hardware techniques can further enhance the ben-
well as memory-mapped I/O) [11]. They use a trace-driven efits of VISC architectures.
reoptimization scheme to optimize frequently executed dy-
namic sequences of code. Crusoe does do not perform any 7. Conclusions and Future Work
offline translation or offline caching, as noted above. Trends in modern processors indicate that CPU cycles
Smith et al. in the Strata project have recently but and raw transistors are becoming increasingly cheap, while
perhaps most clearly articulated the potential benefits of control complexity, wire delays, power, reliability, and test-
VISC processor designs, particularly the benefits of co- ing cost are becoming increasingly difficult to manage. Both
designing the translator and a hardware processor with an trends favor virtual processor architectures: the extra CPU
implementation-dependent ISA [32]. They describe a num- cycles can be spent on software translation, the extra tran-
ber of examples illustrating the flexibility hardware design- sistors can be spent on mechanisms to assist that translation,
ers could derive from this strategy. They have also devel- and a cooperative hardware/software design supported by a
oped several hardware mechanisms that could be valuable rich virtual program representation could be used in numer-
for implementing such architectures, including relational ous ways to reduce hardware complexity and potentially in-
profiling [19], a microarchitecture with a hierarchical reg- crease overall performance.
ister file for instruction-level distributed processing [22], This paper presented LLVA, a design for a language-
and hardware support for working set analysis [12]. They independent, target-independent virtual ISA. The instruc-
do not propose a specific choice of V-ISA, but suggest that tion set is low-level enough to map directly and closely to
one choice would be to use Java VM as the V-ISA (an op- hardware operations but includes high-level type, control-
tion we discussed in the Introduction). flow and dataflow information needed to support sophis-
Previous authors have developed Typed Assembly Lan- ticated analysis and optimization. It includes novel mech-
guages [28, 7] with goals that generally differ significantly anisms to overcome the difficulties faced by previous vir-
from ours. Their goals are to enable compilation from tual architectures such as DAISY and Transmeta’s Crusoe,
strongly typed high-level languages to typed assembly lan- including a flexible exception model, minor constraints on
guage, enabling sound (type preserving) program transfor- self-modifying code to dovetail with the compilation strat-
mations, and to support program safety checking. Their type egy, and an OS-independent interface to access offline stor-
systems are higher-level than ours, because they attempt age and enable offline translation.
to propagate significant type information from source pro- Evaluating the benefits of LLVA requires a long-term
grams. In comparison, our V-ISA uses a much simpler, low- research program. We have three main goals in the near
level type system aimed at capturing the common low-level future: (a) Develop and evaluate cooperative (i.e., code-
representations and operations used to implement compu- signed) software/hardware design choices that reduce hard-
tations from high-level languages. It is also designed to to ware complexity and assist the translator to achieve high
support arbitrary non-type-safe code efficiently, including overall performance. (b) Extend the V-ISA with machine-
operating system and kernel code. independent abstractions of fine- and medium-grain paral-
Binary translation has been widely used to provide bi- lelism, suitable for mapping to explicitly parallel processor
nary compatibility for legacy code. For example, the FX!32 designs, as mentioned in Section 3.6. (c) Port an existing
tool uses a combination of online interpretation and offline operating system (in incremental steps) to work on top of
profile-guided translation to execute Intel IA-32 code on Al- the LLVA architecture, and explore the OS design implica-
pha processors [8]. Unlike such systems, a VISC architec- tions of such an implementation.
ture makes binary translation an essential part of the design
strategy, using it for all codes, not just legacy codes. Acknowledgements
There is a wide range of work on software and hard- We thank Jim Smith, Sarita Adve, John Criswell and
ware techniques for transparent dynamic optimization of the anonymous referees for their detailed feedback on this
programs. Transmeta’s CMS [11] and Dynamo [4] iden- paper. This work has been supported by an NSF CA-
tify and optimize hot traces at runtime, similar to our re- REER award, EIA-0093426, the NSF Operating Systems
and Compilers program under grant number CCR-9988482, [19] T. H. Heil and J. E. Smith. Relational profiling: enabling
and the SIA’s MARCO Focus Center program. thread-level parallelism in virtual machines. In MICRO,
pages 281–290, Monterey, CA, Dec 2000.
References [20] IBM Corp. XL FORTRAN: Eight Ways to Boost Perfor-
[1] A.-R. Adl-Tabatabai, et al. Fast and effective code genera- mance. White Paper, 2000.
tion in a just-in-time Java compiler. In PLDI, May 1998. [21] Intel Corp. Special Issue on Intel HyperThreading Technol-
[2] T. Austin, et al. The pointer-intensive benchmark ogy in Pentium 4 Processors. Intel Technology Journal, Q1,
suite. Available at www.cs.wisc.edu/˜austin/ptr- 2002.
dist.html, Sept 1995. [22] H.-S. Kim and J. E. Smith. An instruction set and microar-
[3] A. Ayers, S. de Jong, J. Peyton, and R. Schooler. Scal- chitecture for instruction level distributed processing. In
able cross-module optimization. ACM SIGPLAN Notices, ISCA, Alaska, May 2002.
33(5):301–312, 1998. [23] A. Klaiber. The Technology Behind Crusoe Processors,
[4] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A trans- 2000.
parent dynamic optimization system. In PLDI, pages 1–12, [24] S. Kowshik, D. Dhurjati, and V. Adve. Ensuring code safety
June 2000. without runtime checks for real-time control systems. In
[5] D. Burger and J. R. Goodman. Billion-transistor architec- CASES, Grenoble, France, Oct 2002.
tures. Computer, 30(9):46–49, Sept 1997. [25] C. Lattner and V. Adve. Automatic Pool Allocation for Dis-
[6] M. G. Burke, J.-D. Choi, S. Fink, D. Grove, M. Hind, joint Data Structures. In Proc. ACM SIGPLAN Workshop on
V. Sarkar, M. J. Serrano, V. C. Sreedhar, H. Srinivasan, and Memory System Performance, Berlin, Germany, Jun 2002.
J. Whaley. The Jalapeño Dynamic Optimizing Compiler for [26] C. Lattner and V. Adve. LLVM: A Compilation Framework
Java. In Java Grande, pages 129–141, 1999. for Lifelong Program Analysis and Transformation. Tech.
[7] J. Chen, D. Wu, A. W. Appel, and H. Fang. A provably sound Report UIUCDCS-R-2003-2380, Computer Science Dept.,
TAL for back-end optimization. In PLDI, San Diego, CA, Univ. of Illinois at Urbana-Champaign, Sept 2003.
Jun 2003. [27] M. C. Merten, A. R. Trick, E. M. Nystrom, R. D. Barnes,
[8] A. Chernoff, et al. FX!32: A profile-directed binary transla- and W. m. W. Hwu. A hardware mechanism for dynamic ex-
tor. IEEE Micro, 18(2):56–64, 1998. traction and relayout of program hot spots. In ISCA, pages
[9] B. E. Clark and M. J. Corrigan. Application system/400 per- 59–70, Jun 2000.
formance characteristics. IBM Systems Journal, 28(3):407– [28] G. Morrisett, D. Walker, K. Crary, and N. Glew. From Sys-
423, 1989. tem F to typed assembly language. TOPLAS, 21(3):528–569,
[10] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and May 1999.
F. K. Zadeck. Efficiently computing static single assignment [29] P. Oberoi and G. S. Sohi. Parallelism in the front-end. In
form and the control dependence graph. TOPLAS, pages ISCA, June 2003.
13(4):451–490, October 1991. [30] S. J. Patel and S. S. Lumetta. rePLay: A Hardware Frame-
[11] J. C. Dehnert, et al. The Transmeta Code Morphing Soft- work for Dynamic Optimization. IEEE Transactions on
ware: Using speculation, recovery and adaptive retransla- Computers, Jun 2001.
tion to address real-life challenges. In Proc. 1st IEEE/ACM [31] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, , and
Symp. Code Generation and Optimization, San Francisco, J. Huh. Exploiting ILP, TLP, and DLP with the Polymor-
CA, Mar 2003. phous TRIPS Architecture. In ISCA, June 2003.
[12] A. S. Dhodapkar and J. E. Smith. Managing multi- [32] J. E. Smith, T. Heil, S. Sastry, and T. Bezenek. Achieving
configuration hardware via dynamic working set analysis. In high performance via co-designed virtual machines. In Inter-
ISCA, Alaska, May 2002. national Workshop on Innovative Architecture (IWIA), 1999.
[13] D. Dhurjati, S. Kowshik, V. Adve, and C. Lattner. Mem- [33] J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sin-
ory safety without runtime checks or garbage collection. In haroy. The POWER4 system microarchitecture. IBM Jour-
LCTES, San Diego, CA, Jun 2003. nal of Research and Development, 46(1):5–26, 2002.
[14] K. Ebcioglu and E. R. Altman. DAISY: Dynamic compi- [34] D. Ungar and R. B. Smith. Self: The power of simplicity. In
lation for 100% architectural compatibility. In ISCA, pages OOPSLA, 1987.
26–37, 1997. [35] C. Zilles and G. Sohi. A programmable coprocessor for pro-
[15] J. Fisher. Walk-time techniques: Catalyst for architectural filing. In HPCA, Jan 2001.
change. Computer, 30(9):46–42, Sept 1997.
[16] R. Ghiya, D. Lavery, and D. Sehr. On the importance of
points-to analysis and other memory disambiguation meth-
ods for C programs. In PLDI. ACM Press, 2001.
[17] J. Gosling, B. Joy, G. Steele, and G. Bracha. The Java
Language Specification, 2nd Ed. Addison-Wesley, Reading,
MA, 2000.
[18] D. Griswold. The Java HotSpot Virtual Machine Architec-
ture, 1998.