VX 32
VX 32
Abstract
1 Introduction
plications. This paper focuses on the vx32 virtual machine itself, describing its sandboxing technique in detail
and analyzing its performance over a variety of applications, host operating systems, and hardware. On real applications, vx32 consistently executes guest code within
a factor of two of native performance; often the overhead
is just a few percent.
This paper first describes background and related work
in Section 2, then presents the design of vx32 in Section 3. Section 4 evaluates vx32 on its own, then Section 5 evaluates vx32 in the context of the above four
applications, and Section 6 concludes.
2 Related Work
Many experimental operating system architectures permit one user process to isolate and confine others to enforce a principle of least privilege: examples include
capability systems [25], L3s clan/chief model [26],
Flukes nested process architecture [14], and generic
software wrappers [15]. The primary performance cost
of kernel-mediated sandboxes like these is that of traversing hardware protection domains, though with careful
design this cost can be minimized [27]. Other systems
permit the kernel itself to be extended with untrusted
code, via domain-specific languages [31], type-safe languages [5], proof-carrying code [32], or special kernelspace protection mechanisms [40]. The main challenge
in all of these approaches is deploying a new operating
system architecture and migrating applications to it.
Other work has retrofitted existing kernels with sandboxing mechanisms for user processes, even taking advantage of x86 segments much as vx32 does [8]. These
mechanisms still require kernel modifications, however,
which are not easily portable even between different x86based OSes. In contrast, vx32 operates entirely in user
space and is easily portable to any operating system that
provides standard features described in Section 3.
System call interposition, a sandboxing method implemented by Janus [19] and similar systems [7, 17, 18, 22,
36], requires minor modifications to existing kernels to
provide a means for one user process to filter or handle
selected system calls made by another process. Since
the sandboxed processs system calls are still fielded by
the host OS before being redirected to the user-level
supervisor process, system call interposition assumes
that the sandboxed process uses the same basic system
call API as the host OS: the supervisor process cannot efficiently export a completely different (e.g., OSindependent) API to the sandboxed process as a vx32
host application can. Some system call interposition
methods also have concurrency-related security vulnerabilities [16, 43], whose only clear solution is delegationbased interposition [17]. Although vx32 has other uses,
finding the appropriate descriptor table entry, the processor checks permission bits (read, write, and execute) and
compares the virtual address of the requested memory
access against the segment limit in the descriptor table,
throwing an exception if any of these checks fail. Finally, the processor adds the segment base to the virtual
address to form the linear address that it subsequently
uses for page translation. Thus, a normal segment with
base b and limit l permits memory accesses at virtual addresses between 0 and l, and maps these virtual addresses
to linear addresses from b to b+l. Todays x86 operating
systems typically make segmentation translation a no-op
by using a base of 0 and a limit of 232 1. Even in this socalled flat model, the processor continues to perform
segmentation translation: it cannot be disabled.
Vx32 allocates two segments in the host applications
LDT for each guest instance: a guest data segment and a
guest control segment, as depicted in Figure 1.
The guest data segment corresponds exactly to the
guest instances address space: the segment base points
to the beginning of the address space (address 0 in the
guest instance), and the segment size is the guests address space size. Vx32 executes guest code with the
processors ds, es, and ss registers holding the selec-
tor for the guest data segment, so that data reads and
writes performed by the guest access this segment by default. (Code sandboxing, described below, ensures that
guest code cannot override this default.) The segmentation hardware ensures that the address space appears at
address 0 in the guest and that the guest cannot access
addresses past the end of the segment. The translation
also makes it possible for the host to unmap a guests address space when it is not in use and remap it later at a
different host address, to relieve congestion in the hosts
address space for example.
The format of the guest data segment is up to vx32s
client: vx32 only requires that it be a contiguous, pagealigned range of virtual memory within the host address
space. Vx32 provides a loader for ELF executables [41],
but clients can load guests by other means. For example,
Plan 9 VX (see section 5.3) uses mmap and mprotect to
implement demand loading of Plan 9 executables.
The guest control segment, shown in Figure 2, contains the data needed by vx32 during guest execution.
The segment begins with a fixed data structure containing saved host registers and other data. The entrypoint
hash table and code fragment cache make up most of the
segment. The hash table maps guest virtual addresses to
code sequences in the code fragment cache. The translated code itself needs to be included in the guest control segment so that vx32 can write to it when patching
previously-translated unconditional branches to jump directly to their targets [38].
Vx32 executes guest code with the processors fs or
gs register holding the selector for the guest control segment. The vx32 runtime accesses the control segment by
specifying a segment override on its data access instructions. Whether vx32 uses fs or gs depends on the host
system, as described in the next section.
scan phase records the length, original offset, instruction type, and worst-case translated size in the
hint table. Jumps are the only instructions whose
translated size is not known exactly at this point.
2. Simplify. The next phase scans the hint table for direct branches within the fragment being translated;
it marks the ones that can be translated into short intrafragment branches using 8-bit jump offsets. After
this phase, the hint table contains the exact size of
the translation for each original guest instruction.
3. Place. Using the now-exact hint table information,
the translator computes the exact offset of each instructions translation. These offsets are needed to
emit intrafragment branches in the last phase.
4. Emit. The final phase writes the translation into
the code fragment cache. For most instructions, the
translation is merely a copy of the original instruction; for unsafe guest instructions, the translation
is an appropriate sequence chosen by vx32.
Vx32 saves the hint table, at a cost of four bytes per
original instruction, in the code fragment cache alongside each translation, for use in exception handling as described in Section 3.4. The hint table could be discarded
and recomputed during exception handling, trading exception handling performance for code cache space.
The rest of this section discusses specific types of
guest instructions. Figure 3 shows concrete examples.
Computational code. Translation leaves most instructions intact. All ordinary computation and data access
instructions (add, mov, and so on) and even floating-point
and vector instructions are safe from vx32s perspective, requiring no translation, because the segmentation
hardware checks all data reads and writes performed by
these instructions against the guest data segments limit.
The only computation instructions that vx32 does not
permit the guest to perform directly are those with x86
segment override prefixes, which change the segment
register used to interpret memory addresses and could
thus be used to escape the data sandbox.
Guest code may freely use all eight general-purpose
registers provided by the x86 architecture: vx32 avoids
both the dynamic register renaming and spilling of translation engines like Valgrind [34] and the static register
usage restrictions of SFI [42]. Allowing guest code to
use all the registers presents a practical challenge for
vx32, however: it leaves no general-purpose register
available where vx32 can store the address of the saved
host registers for use while entering or exiting guest
code. As mentioned above, vx32 solves this problem by
placing the information in the guest control segment and
using an otherwise-unused segment register (fs or gs)
to address it. (Although vx32 does not permit segment
08048160
08048160
jmp
[0x08049248]
b7d8d0f9
b7d8d100
b7d8d107
b7d8d10d
mov
mov
mov
jmp
ebx, fs:[0x2c]
fs:[0x2c], ebx
ebx, [0x08049248]
vxrun_lookup_indirect
0x08048080
b7d8d0f9
b7d8d100
b7d8d105
b7d8d110
b7d8d115
b7d8d119
mov
jmp
mov
jmp
dword
dword
ebx, fs:[0x2c]
0xb7d8d105
fs:[0x5c], 0x00008115
vxrun_lookup_backpatch
0x08048080
0xb7d8d105
b7d8d0f9
b7d8d100
b7d8d107
b7d8d10d
b7d8d112
ret
b7d8d0f9
b7d8d100
b7d8d107
b7d8d108
mov
mov
pop
jmp
ebx, fs:[0x2c]
fs:[0x2c], ebx
ebx
vxrun_lookup_indirect
mov
mov
mov
push
jmp
ebx, fs:[0x2c]
fs:[0x2c], ebx
ebx, [0x08049248]
0x08048166
vxrun_lookup_indirect
call
b7d8d0f9
b7d8d100
b7d8d105
b7d8d10a
b7d8d115
b7d8d11a
b7d8d11e
mov
push
jmp
mov
jmp
dword
dword
0x8048080
ebx, fs:[0x2c]
0x8048165
0xb7d8d10a
fs:[0x5c], 0x0000811a
vxrun_lookup_backpatch
0x08048080
0xb7d8d10a
int
0x30
b7d8d0f9
b7d8d100
b7d8d106
b7d8d10b
b7d8d116
mov
mov
mov
mov
jmp
ebx, fs:[0x2c]
fs:[0x20], eax
eax, 0x230
fs:[0x40], 0x8048162
vxrun_gentrap
The translation saves the guest eax into the guest control
segment, loads the virtual trap number into eax (the 0x200
bit indicates an int instruction), saves the next eip into the
guest control segment, and then jumps to the virtual trap
handler, which will stop the execution loop and return from
vx32, letting the librarys caller handle the trap.
(g) An unsafe or illegal instruction:
08048160
mov
b7d8d0f9
b7d8d100
b7d8d106
b7d8d10b
b7d8d116
mov
mov
mov
mov
jmp
ds, ax
[0x08049248]
call
ebx, fs:[0x2c]
fs:[0x20], eax
eax, 0x006
fs:[0x40], 0x8048160
vxrun_gentrap
Figure 3: Guest code and vx32 translations. Most instructionsarithmetic, data moves, and so onare unchanged by translation.
4 Vx32 Evaluation
This section evaluates vx32 in isolation, comparing
vx32s execution against native execution through microbenchmarks and whole-system benchmarks. Section 5 evaluates vx32 in the context of real applications.
Both sections present experiments run on a variety of test
machines, listed in Figure 4.
4.1 Implementation complexity
The vx32 sandbox library consists of 3,800 lines of C
(1,500 semicolons) and 500 lines of x86 assembly language. The code translator makes up about half of the
C code. Vx32 runs on Linux, FreeBSD, and Mac OS X
without kernel modifications or access to privileged operating system features.
In addition to the library itself, the vx32 system provides a GNU compiler toolchain and a BSD-derived C
library for optional use by guests hosted by applications
that provide a Unix-like system call interface. Host applications are, of course, free to use their own compilers
and libraries and to design new system call interfaces.
4.2 Microbenchmarks
To understand vx32s performance costs, we wrote a
small suite of microbenchmarks exercising illustrative
cases. Figure 5 shows vx32s performance on these tests.
Jump. This benchmark repeats a sequence of 100 noop short jumps. Because a short jump is only two bytes,
the targets are only aligned on 2-byte boundaries. In contrast, vx32s generated fragments are aligned on 4-byte
boundaries. The processors we tested vary in how sensitive they are to jump alignment, but almost all run considerably faster on vx32s 4-byte aligned jumps than the
2-byte jumps in the native code. The Pentium 4 and the
Xeon are unaffected.
Jumpal. This benchmark repeats a sequence of 100
short jumps that are spaced so that each jump target is
aligned on a 16-byte boundary. Most processors execute
vx32s equivalent 4-byte aligned jumps a little slower.
The Pentium 4 and Xeon are, again, unaffected.
Jumpfar. This benchmark repeats a sequence of 100
jumps spaced so that each jump target is aligned on a
4096-byte (page) boundary. This is a particularly hard
Label
Athlon64 x86-32
Core 2 Duo
Opteron x86-32
Opteron x86-64
Pentium 4
Pentium M
Xeon
CPU(s)
1.0GHz AMD Athlon64 2800+
1x2 2.33GHz Intel Core 2 Duo
1.4GHz AMD Opteron 240
1.4GHz AMD Opteron 240
3.06GHz Intel Pentium 4
1.0GHz Intel Pentium M
2x2 3.06GHz Intel Xeon
RAM
2GB
1GB
1GB
1GB
2GB
1GB
2GB
Operating System
Ubuntu 7.10, Linux 2.6.22 (32-bit)
Mac OS X 10.4.10
Ubuntu 7.10, Linux 2.6.22 (32-bit)
Ubuntu 7.10, Linux 2.6.22 (64-bit)
Ubuntu 7.10, Linux 2.6.22
Ubuntu 7.04, Linux 2.6.10
Debian 3.1, Linux 2.6.18
jumpal
jumpfar
call
callind
nullrun
syscall
2.5
3.0
3.3
2.8
3.0
3.0
3.4
3.3
3.9
3.9
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Opteron x86-64, Linux
jump
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Opteron x86-64, Linux
Core 2 Duo, OS X
3.6
Pentium M, Linux
3.9
1.43 Pentium 4, Linux
Athlon64 x86-32, Linux
4.3
Opteron x86-32, Linux
4.4
Opteron x86-64, Linux
4.6
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
1.96
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Opteron x86-64, Linux
6.8
5.1
5.1
5.3
5.4
7.3
6.9
Figure 4: Systems used during vx32 evaluation. The two Opteron listings are a single machine running different operating systems.
The notation 1x2 indicates a single-processor machine with two cores. All benchmarks used gcc 4.1.2.
Figure 5: Normalized run times for microbenchmarks running under vx32. Each bar plots run time using vx32 divided by run
time for the same benchmark running natively (smaller bars mark faster vx32 runs). The benchmarks are described in Section 4.2.
Results for the Intel Xeon matched the Pentium 4 almost exactly and are omitted for space reasons.
1.67
1.65
1.24
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
1.39
1.28
1.64
1.78
1.77
1.16
1.22
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
1.52
1.50
1.47
1.41
1.44
1.61
1.59
401.bzip2
456.hmmer
462.libquantum
445.gobmk
458.sjeng
400.perlbench
464.h264ref
1.16
Core 2 Duo, OS X
0.97
Pentium M, Linux
1.14
Pentium 4, Linux
1.06
Xeon, Linux
1.05
Athlon64 x86-32, Linux 1.08
Opteron x86-32, Linux
1.14
1.26
1.40
1.50
1.51
1.62
1.60
32-bit, native
32-bit, vx32
64-bit, native
64-bit, vx32
0.95
1.65
1.65
1.77
32-bit, native
32-bit, vx32
64-bit, native
64-bit, vx32
32-bit, native
32-bit, vx32
64-bit, native
64-bit, vx32
0.95
0.99
1.59
1.59
1.59
0.92
32-bit, native
32-bit, vx32
64-bit, native
64-bit, vx32
1.09
1.08
1.01
0.81
1.03
32-bit, native
32-bit, vx32
64-bit, native
64-bit, vx32
0.81
401.bzip2
32-bit, native
32-bit, vx32
64-bit, native 0.74
64-bit, vx32
32-bit, native
32-bit, vx32
64-bit, native
64-bit, vx32
1.12
1.14
1.60
1.77
Figure 6: Normalized run times for SPEC CPU2006 benchmarks running under vx32. Each bar plots run time using vx32 divided
by run time for the same benchmark running natively (smaller bars mark faster vx32 runs). The left three benchmarks use fewer
indirect branches than the right four, resulting in less vx32 overhead. The results are discussed further in Section 4.3.
Figure 7: Normalized run times for SPEC CPU2006 benchmarks running in four configurations on the same AMD Opteron system:
natively on 32-bit Linux, under vx32 hosted by 32-bit Linux, natively on 64-bit Linux, and under vx32 hosted by 64-bit Linux.
Each bar plots run time divided by run time for the same benchmark running natively on 32-bit Linux (smaller bars mark faster
runs). Vx32 performance is independent of the host operating systems choice of processor mode, because vx32 always runs guest
code in 32-bit mode. The results are discussed further in Section 4.3.
mance penalty of less than 10%, yet on the other four, the
penalty is 50% or more. The difference between these
two groups is the relative frequency of indirect branches,
which, as discussed in Section 3, are the most expensive
kind of instruction that vx32 must handle.
Figure 8 shows the percentage of indirect branches retired by our Pentium 4 system during each SPEC benchmark, obtained via the CPUs performance counters [21].
The benchmarks that exhibit a high percentage of indirect call, jump, and return instructions are precisely those
that suffer a high performance penalty under vx32.
We also examined vx32s performance running under
a 32-bit host operating system compared to a 64-bit host
operating system. Figure 7 graphs the results. Even
under a 64-bit operating system, the processor switches
to 32-bit mode when executing vx32s 32-bit code segments, so vx32s execution time is essentially identical
in each case. Native 64-bit performance often differs
from 32-bit performance, however: the x86-64 architectures eight additional general-purpose registers can improve performance by requiring less register spilling in
401.bzip2
456.hmmer
462.libquantum
445.gobmk
458.sjeng
400.perlbench
464.h264ref
0%
return instructions retired
1%
2%
compiled code, but its larger pointer size can hurt performance by decreasing cache locality, and the balance
between these factors depends on the workload.
5 Applications
In addition to evaluating vx32 in isolation, we evaluated
vx32 in the context of several applications built using
it. This section evaluates the performance of these applications, but equally important is the ability to create
them in the first place: vx32 makes it possible to create
interesting new applications that execute untrusted x86
code on legacy operating systems without kernel modifications, at only a modest performance cost.
5.1 Archival storage
VXA [13] is an archival storage system that uses vx32 to
future proof compressed data archives against changes
in data compression formats. Data compression algorithms evolve much more rapidly than processor architectures, so VXA packages executable decoders into the
compressed archives along with the compressed data itself. Unpacking the archive in the future then depends
only on being able to run on (or simulate) an x86 processor, not on having the original codecs used to compress the data and being able to run them natively on the
latest operating systems. Crucially, archival storage systems need to be efficiently usable now as well as in the
future: if future proofing an archive using sandboxed
decoders costs too much performance in the short term,
the archive system is unlikely to be used except by professional archivists.
VXA uses vx32 to implement a minimal system call
API (read, write, exit, sbrk). Vx32 provides exactly
what the archiver needs: it protects the host from buggy
or malicious archives, it isolates the decoders from the
hosts system call API so that archives are portable across
operating systems and OS versions, and it executes decoders efficiently enough that VXA can be used as a
general-purpose archival storage system without noticeable slowdown. To ensure that VXA decoders behave
identically on all platforms, VXA instructs vx32 to disable inexact instructions like the 387 intrinsics whose
precise results vary from one processor to another; VXA
decoders simply use SSE and math library equivalents.
Figure 9 shows the performance of vx32-based decoders compared to native ones on the four test architectures. All run within 30% of native performance, often much closer. The jpeg decoder is consistently faster
under vx32 than natively, due to better cache locality.
5.2 Extensible public key infrastructure
Alpaca [24] is an extensible public-key infrastructure
(PKI) and authorization framework built on the idea of
proof-carrying authorization (PCA) [3], in which one
party authenticates itself to another by using an explicit
logical language to prove that it deserves a particular
kind of access or is authorized to request particular ser-
9.1
3.9
native
vx32
1.00
VMware
QEMU
native
vx32 0.57
VMware
QEMU
native
vx32 0.63
VMware
1.32
QEMU
rdwr
sha1zero
du
mk
1.90
1.90
18
native
vx32
0.93
VMware
QEMU
pipe-bulk
2.7
2.6
2.8
pipe-byte
2.5
syscall
3.8
21
22
0
2.7
native
vx32
VMware
QEMU
1
4.8
23
native
vx32
VMware
QEMU
2
1.69
0.98
1.07
1.07
1.03
1.11
1.11
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Core 2 Duo, OS X
0.74
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
sha1
sha512
ripemd
whirlpool
1.21
1.10
1.16
1.17
1.03
1.02
1.06
1.04
1.11
1.14
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
md5
0.85
1.03
1.14
1.08
1.04
1.15
1.07
1.11
1.23
1.08
1.18
1.17
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
0
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
0.92
native
vx32
VMware
QEMU
0.94
0.97
1.00
1.00
1.06
1.04
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Core 2 Duo, OS X
0.71
Pentium M, Linux
0.73
Pentium 4, Linux
0.68
Xeon, Linux
0.75
Athlon64 x86-32, Linux
0.91
Opteron x86-32, Linux
0.89
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
zlib
bz2
jpeg
jp2
vorbis
flac
1.21
1.16
1.03
1.13
1.09
0.99
0.92
0.92
1.02
0.98
1.02
0.97
1.22
1.18
1.10
1.28
1.27
1.07
0.99
0.95
1.00
1.00
1.08
1.06
0
Core 2 Duo, OS X
Pentium M, Linux
Pentium 4, Linux
Xeon, Linux
Athlon64 x86-32, Linux
Opteron x86-32, Linux
Figure 9: Normalized run times for VXA decoders running under vx32. Each bar plots run time using vx32 divided by run time
for the same benchmark running natively (smaller bars mark faster vx32 runs). Section 5.1 gives more details. The jpeg test runs
faster because the vx32 translation has better cache locality than the original code.
Figure 10: Normalized run times for cryptographic hash functions running under vx32. Each bar plots run time using vx32 divided
by run time for the same benchmark running natively (smaller bars mark faster runs).
Figure 11: Normalized run times for simple Plan 9 benchmarks. The four bars correspond to Plan 9 running natively, Plan 9 VX,
Plan 9 under VMware Workstation 6.0.2 on Linux, and Plan 9 under QEMU on Linux using the kqemu kernel extension. Each
bar plots run time divided by the native Plan 9 run time (smaller bars mark faster runs). The tests are: swtch, a system call that
reschedules the current process, causing a context switch (sleep(0)); pipe-byte, two processes sending a single byte back and forth
over a pair of pipes; pipe-bulk, two processes (one sender, one receiver) transferring bulk data over a pipe; rdwr, a single process
copying from /dev/zero to /dev/null; sha1zero, a single process reading /dev/zero and computing its SHA1 hash; du, a single
process traversing the file system; and mk, building a Plan 9 kernel. See Section 5.3 for performance explanations.
6 Conclusion
Vx32 is a multipurpose user-level sandbox that enables
any application to load and safely execute one or more
guest plug-ins, confining each guest to a system call
API controlled by the host application and to a restricted
memory region within the hosts address space. It executes sandboxed code efficiently on x86 architecture machines by using the x86s segmentation hardware to isolate memory accesses along with dynamic code translation to disallow unsafe instructions.
Vx32s ability to sandbox untrusted code efficiently
has enabled a variety of interesting applications: selfextracting archival storage, extensible public-key infrastructure, a user-level operating system, and portable or
restricted execution environments. Because vx32 works
on widely-used x86 operating systems without kernel
modifications, these applications are easy to deploy.
In the context of these applications (and also on the
SPEC CPU2006 benchmark suite), vx32 always delivers sandboxed execution performance within a factor of
two of native execution. Many programs execute within
10% of the performance of native execution, and some
programs execute faster under vx32 than natively.
Acknowledgments
Chris Lesniewski-Laas is the primary author of Alpaca.
We thank Austin Clements, Stephen McCamant, and the
anonymous reviewers for valuable feedback. This research is sponsored by the T-Party Project, a joint research program between MIT and Quanta Computer Inc.,
Taiwan, and by the National Science Foundation under
FIND project 0627065 (User Information Architecture).
References
[1] Keith Adams and Ole Agesen. A comparison of software
and hardware techniques for x86 virtualization. In ASPLOS XIII, December 2006.
[2] Advanced Micro Devices, Inc. AMD x86-64 architecture
programmers manual, September 2002.
[3] Andrew W. Appel and Edward W. Felten. Proof-carrying
authentication. In 6th ACM CCS, November 1999.
[4] Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Dynamo: a transparent dynamic optimization system.
ACM SIGPLAN Notices, 35(5):112, 2000.
[5] Brian N. Bershad et al. Extensibility, safety and performance in the SPIN operating system. In 15th SOSP, 1995.
[6] Brian Case. Implementing the Java virtual machine. Microprocessor Report, 10(4):1217, March 1996.
[7] Suresh N. Chari and Pau-Chen Cheng. BlueBox: A
policy-driven, host-based intrusion detection system. In
Network and Distributed System Security, February 2002.
[8] Tzi-cker Chiueh, Ganesh Venkitachalam, and Prashant
Pradhan. Integrating segmentation and paging protection
for safe, efficient and transparent software extensions. In
17th SOSP, pages 140153, December 1999.
[9] Bob Cmelik and David Keppel. Shade: A fast instructionset simulator for execution profiling. SIGMETRICS PER,
22(1):128137, May 1994.
[10] R. J. Creasy. The origin of the VM/370 time-sharing
system. IBM Journal of Research and Development,
25(5):483490, 1981.
[11] L. Peter Deutsch and Allan M. Schiffman. Efficient implementation of the Smalltalk-80 system. In Principles of
Programming Languages, pages 297302, Salt Lake City,
UT, January 1984.
[12] D. Eastlake 3rd and T. Hansen. US secure hash algorithms
(SHA and HMAC-SHA), July 2006. RFC 4634.
[13] Bryan Ford. VXA: A virtual architecture for durable compressed archives. In 4th USENIX FAST, San Francisco,
CA, December 2005.
[14] Bryan Ford, Mike Hibler, Jay Lepreau, Patrick Tullmann,
Godmar Back, and Stephen Clawson. Microkernels meet
recursive virtual machines. In 2nd OSDI, pages 137151,
1996.
[15] Timothy Fraser, Lee Badger, and Mark Feldman. Hardening COTS software with generic software wrappers. In
IEEE Symposium on Security and Privacy, pages 216,
1999.
[16] Tal Garfinkel. Traps and pitfalls: Practical problems in
system call interposition based security tools. In Network
and Distributed System Security, February 2003.
[17] Tal Garfinkel, Ben Pfaff, and Mendel Rosenblum. Ostia:
A delegating architecture for secure system call interposition. In Network and Distributed System Security, February 2004.
[18] Douglas P. Ghormley, David Petrou, Steven H. Rodrigues, and Thomas E. Anderson. SLIC: An extensibility system for commodity operating systems. In USENIX,
June 1998.
[19] Ian Goldberg, David Wagner, Randi Thomas, and Eric A.
Brewer. A secure environment for untrusted helper applications. In 6th USENIX Security Symposium, San Jose,
CA, 1996.
[20] Honeywell Inc. GCOS Environment Simulator. December 1983. Order Number AN05-02A.
[21] Intel Corporation. IA-32 Intel architecture software developers manual, June 2005.
[22] K. Jain and R. Sekar. User-level infrastructure for system
call interposition: A platform for intrusion detection and
confinement. In Network and Distributed System Security, February 2000.
[23] Andreas Krall. Efficient JavaVM just-in-time compilation. In Parallel Architectures and Compilation Techniques, pages 5461, Paris, France, October 1998.
[24] Christopher Lesniewski-Laas, Bryan Ford, Jacob Strauss,
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
M. Frans Kaashoek, and Robert Morris. Alpaca: extensible authorization for distributed services. In ACM Computer and Communications Security, October 2007.
Henry M Levy. Capability-based Computer Systems.
Digital Press, 1984.
Jochen Liedtke. A persistent system in real use: experiences of the first 13 years. In IWOOOS, 1993.
Jochen Liedtke. On micro-kernel construction. In 15th
SOSP, 1995.
Chi-Keung Luk et al. Pin: building customized program
analysis tools with dynamic instrumentation. In PLDI,
June 2005.
Stephen McCamant and Greg Morrisett. Evaluating SFI
for a CISC architecture. In 15th USENIX Security Symposium, August 2006.
Microsoft Corporation. C# language specification, version 3.0, 2007.
Jeffrey C. Mogul, Richard F. Rashid, and Michael J. Accetta. The packet filter: An efficient mechanism for userlevel network code. In Symposium on Operating System
Principles, pages 3951, Austin, TX, November 1987.
George C. Necula and Peter Lee. Safe kernel extensions
without run-time checking. In 2nd OSDI, pages 229243,
1996.
Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. In Third Workshop on Runtime Verification (RV03), Boulder, CO, July 2003.
Nicholas Nethercote and Julian Seward. Valgrind: A
framework for heavyweight dynamic binary instrumentation. In PLDI, June 2007.
Rob Pike et al. Plan 9 from Bell Labs. Computing Systems, 8(3):221254, Summer 1995.
Niels Provos. Improving host security with system call
policies. In 12th USENIX Security Symposium, August
2003.
K. Scott et al. Overhead reduction techniques for software
dynamic translation. In NSF Workshop on Next Generation Software, April 2004.
Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson. Binary translation.
Communications of the ACM, 36(2):6981, 1993.
Christopher Small and Margo Seltzer. MiSFIT: Constructing safe extensible systems. IEEE Concurrency,
6(3):3441, 1998.
Michael M. Swift, Brian N. Bershad, and Henry M. Levy.
Improving the reliability of commodity operating systems. In 19th ACM SOSP, 2003.
Tool Interface Standard (TIS) Committee. Executable and
linking format (ELF) specification, May 1995.
Robert Wahbe, Steven Lucco, Thomas E. Anderson, and
Susan L. Graham. Efficient software-based fault isolation.
ACM SIGOPS Operating Systems Review, 27(5):203
216, December 1993.
Robert N. M. Watson. Exploiting concurrency vulnerabilities in system call wrappers. In 1st USENIX Workshop on
Offensive Technologies, August 2007.
Emmett Witchel and Mendel Rosenblum. Embra: Fast
and flexible machine simulation. In Measurement and
Modeling of Computer Systems, pages 6879, 1996.