Exokernel: An Operating System Architecture For Application-Level Resource Management
Exokernel: An Operating System Architecture For Application-Level Resource Management
Traditional operating systems limit the performance, flexibility, and
functionality of applications by fixing the interface and implementation of operating system abstractions such as interprocess communication and virtual memory. The exokernel operating system
architecture addresses this problem by providing application-level
management of physical resources. In the exokernel architecture, a
small kernel securely exports all hardware resources through a lowlevel interface to untrusted library operating systems. Library operating systems use this interface to implement system objects and
policies. This separation of resource protection from management
allows application-specific customization of traditional operating
system abstractions by extending, specializing, or even replacing
libraries.
We have implemented a prototype exokernel operating system.
Measurements show that most primitive kernel operations (such
as exception handling and protected control transfer) are ten to 100
times faster than in Ultrix, a mature monolithic UNIX operating system. In addition, we demonstrate that an exokernel allows applications to control machine resources in ways not possible in traditional
operating systems. For instance, virtual memory and interprocess
communication abstractions are implemented entirely within an
application-level library. Measurements show that application-level
virtual memory and interprocess communication primitives are five
to 40 times faster than Ultrixs kernel primitives. Compared to
state-of-the-art implementations from the literature, the prototype
exokernel system is at least five times faster on operations such as
exception dispatching and interprocess communication.
inappropriate for three main reasons: it denies applications the advantages of domain-specific optimizations, it discourages changes
to the implementations of existing abstractions, and it restricts the
flexibility of application builders, since new abstractions can only
be added by awkward emulation on top of existing ones (if they can
be added at all).
We believe these problems can be solved through applicationlevel (i.e., untrusted) resource management. To this end, we have
designed a new operating system architecture, exokernel, in which
traditional operating system abstractions, such as virtual memory
(VM) and interprocess communication (IPC), are implemented entirely at application level by untrusted software. In this architecture,
a minimal kernelwhich we call an exokernelsecurely multiplexes available hardware resources. Library operating systems,
working above the exokernel interface, implement higher-level abstractions. Application writers select libraries or implement their
own. New implementations of library operating systems are incorporated by simply relinking application executables.
Substantial evidence exists that applications can benefit greatly
from having more control over how machine resources are used
to implement higher-level abstractions. Appel and Li [5] reported
that the high cost of general-purpose virtual memory primitives
reduces the performance of persistent stores, garbage collectors, and
distributed shared memory systems. Cao et al. [10] reported that
application-level control over file caching can reduce application
running time by 45%. Harty and Cheriton [26] and Krueger et
al. [30] showed how application-specific virtual memory policies
can increase application performance. Stonebraker [47] argued
that inappropriate file-system implementation decisions can have a
dramatic impact on the performance of databases. Thekkath and
Levy [50] demonstrated that exceptions can be made an order of
magnitude faster by deferring signal handling to applications.
To provide applications control over machine resources, an exokernel defines a low-level interface. The exokernel architecture is
founded on and motivated by a single, simple, and old observation:
the lower the level of a primitive, the more efficiently it can be
implemented, and the more latitude it grants to implementors of
higher-level abstractions.
To provide an interface that is as low-level as possible (ideally,
just the hardware interface), an exokernel designer has a single
overriding goal: to separate protection from management. For
instance, an exokernel should protect framebuffers without understanding windowing systems and disks without understanding file
systems. One approach is to give each application its own virtual
machine [17]. As we discuss in Section 8, virtual machines can
have severe performance penalties. Therefore, an exokernel uses a
different approach: it exports hardware resources rather than emulating them, which allows an efficient and simple implementation.
An exokernel employs three techniques to export resources securely.
First, by using secure bindings, applications can securely bind to
machine resources and handle events. Second, by using visible re-
Fixed high-level abstractions limit the functionality of applications, because they are the only available interface between applications and hardware resources. Because all applications must
share one set of abstractions, changes to these abstractions occur
rarely, if ever. This may explain why few good ideas from the
last decade of operating systems research have been adopted into
widespread use: how many production operating systems support
scheduler activations [4], multiple protection domains within a single address space [11], efficient IPC [33], or efficient and flexible
virtual memory primitives [5, 26, 30]?
"!#$
Fixed high-level abstractions hide information from applications. For instance, most current systems do not make low-level
exceptions, timer interrupts, or raw device I/O directly available to
application-level software. Unfortunately, hiding this information
makes it difficult or impossible for applications to implement their
own resource management abstractions. For example, database implementors must struggle to emulate random-access record storage
on top of file systems [47]. As another example, implementing
lightweight threads on top of heavyweight processes usually requires compromises in correctness and performance, because the
operating system hides page faults and timer interrupts [4]. In such
cases, application complexity increases because of the difficulty of
getting good performance from high-level abstractions.
Fixed high-level abstractions hurt application performance because there is no single way to abstract physical resources or to
implement an abstraction that is best for all applications. In implementing an abstraction, an operating system is forced to make
trade-offs between support for sparse or dense address spaces,
read-intensive or write-intensive workloads, etc. Any such tradeoff penalizes some class of applications. For example, relational
databases and garbage collectors sometimes have very predictable
data access patterns, and their performance suffers when a generalpurpose page replacement strategy such as LRU is imposed by
the operating system. The performance improvements of such
application-specific policies can be substantial; Cao et al. [10] measured that application-controlled file caching can reduce application
running time by as much as 45%.
In practice, our prototype exokernel system provides applications with greater flexibility and better performance than monolithic and microkernel systems. Aegiss low-level interface allows
application-level software such as ExOS to manipulate resources
very efficiently. Aegiss protected control transfer is almost seven
times faster than the best reported implementation [33]. Aegiss
exception dispatch is five times faster than the best reported implementation [50]. On identical hardware, Aegiss exception dispatch
and control transfer are roughly two orders of magnitude faster than
in Ultrix 4.2, a mature monolithic system.
Traditionally, operating systems have centralized resource management via a set of abstractions that cannot be specialized, extended,
or replaced. Whether provided by the kernel or by trusted userlevel servers (as in microkernel-based systems), these abstractions
are implemented by privileged software that must be used by all
applications, and therefore cannot be changed by untrusted software. Typically, the abstractions include processes, files, address
spaces, and interprocess communication. In this section we discuss
the problems with general-purpose implementations of these ab-
%
$$ '&
"
(
%)*
Applications
DSM
WWW
POSIX
BarnesHut
VM
IPC
TCP
Library operating systems
Exokernel
Traps
Secure bindings
TLB
Network
Memory
Disk
To provide the maximum opportunity for application-level resource management, the exokernel architecture consists of a thin
exokernel veneer that multiplexes and exports physical resources
securely through a set of low-level primitives. Library operating
systems, which use the low-level exokernel interface, implement
higher-level abstractions and can define special-purpose implementations that best meet the performance and functionality goals of
applications (see Figure 1). (For brevity, we sometimes refer to
library operating system as application.) This structure allows
the extension, specialization and even replacement of abstractions.
For instance, page-table structures can vary among library operating systems: an application can select a library with a particular
implementation of a page table that is most suitable to its needs.
To the best of our knowledge, no other secure operating system
architecture allows applications so much useful freedom.
$
)
$$
This paper demonstrates that the exokernel architecture is an effective way to address the problems listed in Section 2.1. Many of
these problems are solved by simply moving the implementation of
abstractions to application level, since conflicts between application
needs and available abstractions can then be resolved without the
intervention of kernel architects. Furthermore, secure multiplexing
does not require complex algorithms; it mostly requires tables to
track ownership. Therefore, the implementation of an exokernel can
be simple. A simple kernel improves reliability and ease of maintenance, consumes few resources, and enables quick adaptation to
new requirements (e.g., gigabit networking). Additionally, as is
true with RISC instructions, the simplicity of exokernel operations
allows them to be implemented efficiently.
An exokernel specifies the details of the interface that library operating systems use to claim, release, and use machine resources.
This section articulates some of the principles that have guided
our efforts to design an exokernel interface that provides library
operating systems the maximum degree of control.
Securely expose hardware. The central tenet of the exokernel architecture is that the kernel should provide secure low-level
primitives that allow all hardware resources to be accessed as directly as possible. An exokernel designer therefore strives to safely
export all privileged instructions, hardware DMA capabilities, and
machine resources. The resources exported are those provided by
the underlying hardware: physical memory, the CPU, disk memory,
translation look-aside buffer (TLB), and addressing context identifiers. This principle extends to less tangible machine resources
such as interrupts, exceptions, and cross-domain calls. An exokernel should not impose higher-level abstractions on these events
(e.g., Unix signal or RPC semantics). For improved flexibility,
most physical resources should be finely subdivided. The number, format, and current set of TLB mappings should be visible to
and replaceable by library operating systems, as should any privileged co-processor state. An exokernel must export privileged
instructions to library operating systems to enable them to implement traditional operating system abstractions such as processes
and address spaces. Each exported operation can be encapsulated
within a system call that checks the ownership of any resources
involved.
A secure binding is a protection mechanism that decouples authorization from the actual use of a resource. Secure bindings
improve performance in two ways. First, the protection checks
involved in enforcing a secure binding are expressed in terms of
simple operations that the kernel (or hardware) can implement
quickly. Second, a secure binding performs authorization only at
bind time, which allows management to be decoupled from protection. Application-level software is responsible for many resources
with complex semantics (e.g., network connections). By isolating
the need to understand these semantics to bind time, the kernel can
efficiently implement access checks at access time without understanding them. Simply put, a secure binding allows the kernel to
protect resources without understanding them.
Expose allocation. An exokernel should allow library operating systems to request specific physical resources. For instance, if a
library operating system can request specific physical pages, it can
reduce cache conflicts among the pages in its working set [29]. Furthermore, resources should not be implicitly allocated; the library
operating system should participate in every allocation decision.
The next principle aids the effectiveness of this participation.
An exokernel hands over resource policy decisions to library operating systems. Using this control over resources, an application
or collection of cooperating applications can make decisions about
how best to use these resources. However, as in all systems, an exokernel must include policy to arbitrate between competing library
operating systems: it must determine the absolute importance of
different applications, their share of resources, etc. This situation
is no different than in traditional kernels. Appropriate mechanisms
are determined more by the environment than by the operating
system architecture. For instance, while an exokernel cedes management of resources over to library operating systems, it controls
the allocation and revocation of these resources. By deciding which
translations that do not fit in the hardware TLB. The software TLB
can be viewed as a cache of frequently-used secure bindings.
Our prototype exokernel uses packet filters, because our current network does not provide hardware mechanisms for message
demultiplexing. One challenge with a language-based approach is
to make running filters fast. Traditionally, packet filters have been
interpreted, making them less efficient than in-kernel demultiplexing routines. One of the distinguishing features of the packet filter
engine used by our prototype exokernel is that it compiles packet
filters to machine code at runtime, increasing demultiplexing performance by more than an order of magnitude [22].
%
$)
The one problem with the use of a packet filter is ensuring that
that a filter does not lie and accept packets destined to another
process. Simple security precautions such as only allowing a trusted
server to install filters can be used to address this problem. On a
system that assumes no malicious processes, our language is simple
enough that in many cases even the use of a trusted server can be
avoided by statically checking a new filter to ensure that it cannot
accept packets belonging to another; by avoiding the use of any
central authority, extensibility is increased.
Secure bindings to physical memory are implemented in our prototype exokernel using self-authenticating capabilities [12] and address translation hardware. When a library operating system allocates a physical memory page, the exokernel creates a secure
binding for that page by recording the owner and the read and write
capabilities specified by the library operating system. The owner
of a page has the power to change the capabilities associated with
it and to deallocate it.
To ensure protection, the exokernel guards every access to a
physical memory page by requiring that the capability be presented
by the library operating system requesting access. If the capability is
insufficient, the request is denied. Typically, the processor contains
a TLB, and the exokernel must check memory capabilities when a
library operating system attempts to enter a new virtual-to-physical
mapping. To improve library operating system performance by
reducing the number times secure bindings must be established,
an exokernel may cache virtual-to-physical mappings in a large
software TLB.
%
$
Application-specific Safe Handlers (ASHs) are a more interesting example of downloading code into our prototype exokernel.
These application handlers can be downloaded into the kernel to
participate in message processing. An ASH is associated with a
packet filter and runs on packet reception. One of the key features
of an ASH is that it can initiate a message. Using this feature,
roundtrip latency can be greatly reduced, since replies can be transmitted on the spot instead of being deferred until the application
is scheduled. ASHs have a number of other useful features (see
Section 6).
To break a secure binding, an exokernel must change the associated capabilities and mark the resource as free. In the case of
physical memory, an exokernel would flush all TLB mappings and
any queued DMA requests.
%
Multiplexing the network efficiently is challenging, since protocolspecific knowledge is required to interpret the contents of incoming
messages and identify the intended recipient.
One possible abort protocol is to simply kill any library operating system and its associated application that fails to respond
quickly to revocation requests. We rejected this method because
we believe that most programmers have great difficulty reasoning
about hard real-time bounds. Instead, if a library operating system
fails to comply with the revocation protocol, an exokernel simply
breaks all existing secure bindings to the resource and informs the
library operating system.
where merging a lower-level, imperative language would be infeasible. However, in cases where such optimizations are not done,
(e.g., in an exception handler) a low-level language is more in keeping with the exokernel philosophy: it allows the broadest range
of application-level languages to be targeted to it and the simplest
implementation. ASHs are another example of this tradeoff: most
ASHs are imported into the kernel in the form of the object code
of the underlying machine; however, in the few key places where
higher level semantics are useful we have extended the instruction
set of the machine.
%
'
#
#
$
The use of physical resource names requires that an exokernel reveal each revocation to the relevant library operating system so that
it can relocate its physical names. For instance, a library operating
system that relinquishes physical page 5 should update any of its
page-table entries that refer to this page. This is easy for a library
operating system to do when it deallocates a resource in reaction to
an exokernel revocation request. An abort protocol (discussed below) allows relocation to be performed when an exokernel forcibly
reclaims a resource.
$
%
)
%
We view the revocation process as a dialogue between an exokernel and a library operating system. Library operating systems
should organize resource lists so that resources can be deallocated
quickly. For example, a library operating system could have a simple vector of physical pages that it owns: when the kernel indicates
that some page should be deallocated, the library operating system
selects one of its pages, writes it to disk, and frees it.
We have implemented two software systems that follow the exokernel architecture: Aegis, an exokernel, and ExOS, a library operating
system. Another prototype exokernel, Glaze, is being built for an
experimental SPARC-based shared-memory multiprocessor [35],
along with PhOS, a parallel operating system library.
An exokernel must also be able to take resources from library operating systems that fail to respond satisfactorily to revocation requests.
An exokernel can define a second stage of the revocation protocol
in which the revocation request (please return a memory page)
becomes an imperative (return a page within 50 microseconds).
However, if a library operating system fails to respond quickly, the
secure bindings need to be broken by force. The actions taken
when a library operating system is recalcitrant are defined by the
abort protocol.
Machine
DEC2100 (12.5 MHz)
DEC3100 (16.67 MHz)
DEC5000/125 (25 MHz)
Processor
R2000
R3000
R3000
SPEC rating
8.7 SPECint89
11.8 SPECint89
16.1 SPECint92
System call
Yield
Scall
Acall
Alloc
Dealloc
MIPS
11
15
25
Description
Yield processor to named process
Synchronous protected control transfer
Asynchronous protected control transfer
Allocation of resources (e.g., physical page)
Deallocation of resources
Primitive operations
TLBwr
FPUmod
CIDswitch
TLBvadelete
Description
Insert mapping into TLB
Enable/disable FPU
Install context identifier
Delete virtual address from TLB
&
%
In addition, we attempt to assess Aegiss and ExOSs performance in the light of recent advances in operating systems research.
These advances have typically been evaluated on different hardware
and frequently use experimental software, making head-to-head
comparisons impossible. In these cases we base our comparisons
on relative SPECint ratings and instruction counts.
#
(
Table 1 shows the specific machine configurations used in the experiments. For brevity, we refer to the DEC5000/125 as DEC5000.
The three machine configurations are used to get a tentative measure of the scalability of Aegis. All times are measured using the
wall-clock. We used clock on the Unix implementations and
a microsecond counter on Aegis. Aegiss time quantum was set
at 15.625 milliseconds. All benchmarks were compiled using an
identical compiler and flags: gcc version 2.6.0 with optimization
flags -O2. None of the benchmarks use floating-point instructions; therefore, we do not save floating-point state. Both systems
were run in single-user mode and were isolated from the network.
'
'
Timer interrupts denote the beginning and end of time slices, and
are delivered in a manner similar to exceptions (discussed below): a
register is saved in the interrupt save area, the exception program
counter is loaded, and Aegis jumps to user-specified interrupt handling code with interrupts re-enabled. The applications handlers
are responsible for general-purpose context switching: saving and
restoring live registers, releasing locks, etc. This framework gives
applications a large degree of control over context switching. For
example, it can be used to implement scheduler activations [4].
Fairness is achieved by bounding the time an application takes to
save its context: each subsequent timer interrupt (which demarcates
a time slice) is recorded in an excess time counter. Applications pay
7
Machine
DEC2100
DEC2100
DEC3100
DEC3100
DEC5000
DEC5000
%
OS
Ultrix
Aegis
Ultrix
Aegis
Ultrix
Aegis
Procedure call
0.57
0.56
0.42
0.42
0.28
0.28
Syscall (getpid)
32.2
3.2 / 4.7
33.7
2.9 / 3.5
21.3
1.6 / 2.3
%)*
An Aegis processor environment is a structure that stores the information needed to deliver events to applications. All resource
consumption is associated with an environment because Aegis must
deliver events associated with a resource (such as revocation exceptions) to its designated owner.
Machine
DEC2100
DEC2100
DEC3100
DEC3100
DEC5000
DEC5000
Four kinds of events are delivered by Aegis: exceptions, interrupts, protected control transfers, and address translations. Processor environments contain the four contexts required to support these
events:
OS
Ultrix
Aegis
Ultrix
Aegis
Ultrix
Aegis
unalign
n/a
2.8
n/a
2.1
n/a
1.5
overflow
208.0
2.8
151.0
2.1
130.0
1.5
coproc
n/a
2.8
n/a
2.1
n/a
1.5
prot
238.0
3.0
177.0
2.3
154.0
1.5
Interrupt context: for each interrupt an interrupt context includes a program counters and register-save region. In the case
of timer interrupts, the interrupt context specifies separate program
counters for start-time-slice and end-time-slice cases, as well as
status register values that control co-processor and interrupt-enable
flags.
These are the event-handling contexts required to define a process. Each context depends on the others for validity: for example,
an addressing context does not make sense without an exception
context, since it does not define any action to take when an exception or interrupt occurs.
% '
The base cost for null procedure and system calls are shown in
Table 4. The null procedure call shows that Aegiss scheduling
flexibility does not add overhead to base operations. Aegis has two
system call paths: the first for system calls that do not require a
stack, the second for those that do. With the exception of protected
control transfers, which are treated as a special case for efficiency,
all Aegis system calls are vectored along one of these two paths.
Ultrixs getpid is approximately an order of magnitude slower
than Aegiss slowest system call paththis suggests that the base
cost of demultiplexing system calls is significantly higher in Ultrix.
Part of the reason Ultrix is so much less efficient on this basic operation is that it performs a more expensive demultiplexing operation.
For example, on a MIPS processor, kernel TLB faults are vectored
general class of exceptions in its exception demultiplexing routine. Fast exceptions enable a number of intriguing applications:
efficient page-protection traps can be used by applications such as
distributed shared memory systems, persistent object stores, and
garbage collectors [5, 50].
OS
Aegis
Aegis
Aegis
L3
MHz
12.5MHz
16.67MHz
25MHz
50MHz
Transfer cost
2.9
2.2
1.4
9.3 (normalized)
Table 6: Time to perform a (unidirectional) protected control transfer; times are in microseconds.
contrast, performing an upcall to application level on a TLB miss,
followed by a system call to install a new mapping is at least three
to six microseconds more expensive.
As dictated by the exokernel principle of exposing kernel bookkeeping structures, the STLB can be mapped using a well-known
capability, which allows applications to efficiently probe for entries.
Machine
DEC2100
DEC3100
DEC5000
486
%
Table 6 shows the performance in microseconds of a barebone protected control transfer. This time is derived by dividing
the time to perform a call and reply in half (i.e., we measure the time
to perform a unidirectional control transfer). Since the experiment
is intended to measure the cost of protected control transfer only, no
registers are saved and restored. However, due to our measurement
code, the time includes the overhead of incrementing a counter and
performing a branch.
The STLB contains 4096 entries of 8 bytes each. It is directmapped and resides in unmapped physical memory. An STLB hit
takes 18 instructions (approximately one to two microseconds). In
9
Filter
MPF
PATHFINDER
DPF
Classification Time
35.0
19.0
1.5
Machine
DEC2100
DEC2100
DEC3100
DEC3100
DEC5000
DEC5000
Table 7: Time on a DEC5000/200 to classify TCP/IP headers destined for one of ten TCP/IP filters; times are in microseconds.
$
*
%)
lrpc
n/a
13.9
n/a
10.4
n/a
6.3
&
%
$
)
shm
187.0
12.4
139.0
9.3
118.0
5.7
pipe
n/a
24.8
n/a
18.6
n/a
10.7
Aegiss network subsystem uses aggressive dynamic code generation techniques to provide efficient message demultiplexing and
handling. We briefly discuss some key features of this system. A
complete discussion can be found in [22].
pipe
326.0
30.9
243.0
22.6
199.0
14.2
Table 8: Time for IPC using pipes, shared memory, and LRPC
on ExOS and Ultrix; times are in microseconds. Pipe and shared
memory are unidirectional, while LRPC is bidirectional.
OS
Ultrix
ExOS
Ultrix
ExOS
Ultrix
ExOS
Because Ultrix is built around a set of fixed high-level abstractions, new primitives can be added only by emulating them on top of
existing ones. Specifically, implementations of lrpc must use pipes
or signals to transfer control. The cost of such emulation is high:
on Ultrix, lrpc using pipes costs 46 to 60 more than on ExOS and
using signals costs 26 to 37 more than on ExOS. These experiments
10
Machine
DEC2100
DEC2100
DEC3100
DEC3100
DEC5000
DEC5000
OS
Ultrix
ExOS
Ultrix
ExOS
Ultrix
ExOS
matrix
7.1
7.0
5.2
5.2
3.8
3.7
If we compare the time for dirty to the time for prot1, we see
that over half the time spent in prot1 is due to the overhead of
parsing the page table. As we show in Section 7.2, this overhead
can be reduced through the use of a data structure more tuned to
efficient lookup (e.g., a hash table). Even with this penalty, ExOS
performs prot1 almost twice as fast as Ultrix. The likely reason for
this difference is that, as shown in Table 4, Aegis dispatches system
calls an order of magnitude more efficiently than Ultrix.
%
#$
)
ExOS provides a rudimentary virtual memory system (approximately 1000 lines of heavily commented code). Its two main limitations are that it does not handle swapping and that page-tables are
implemented as a linear vector (address translations are looked up
in this structure using binary search). Barring these two limitations,
its interface is richer than other virtual memory systems we know
of. It provides flexible support for aliasing, sharing, disabling and
enabling of caching on a per-page basis, specific page-allocation,
and DMA.
The overhead of application-level memory is measured by performing a 150 by 150 integer matrix multiplication. Because this
naive version of matrix multiply does not use any of the special
abilities of ExOS or Aegis (e.g., page-coloring to reduce cache
conflicts), we expect it to perform equivalently on both operating
systems. The times in Table 9 indicate that application-level virtual
memory does not add noticeable overhead to operations that have
reasonable virtual memory footprints. Of course, this is hardly a
conclusive proof.
%
ASH
Machine
DEC2100
DEC2100
DEC3100
DEC3100
DEC5000
DEC5000
OS
Ultrix
ExOS
Ultrix
ExOS
Ultrix
ExOS
dirty
n/a
17.5
n/a
13.1
n/a
9.8
prot1
51.6
32.5
39.0
24.4
32.0
16.9
prot100
175.0
213.0
133.0
156.0
102.0
109.0
unprot100
175.0
275.0
133.0
206.0
102.0
143.0
trap
240.0
13.9
185.0
10.1
161.0
4.8
appel1
383.0
74.4
302.0
55.0
262.0
34.0
appel2
335.0
45.9
267.0
34.0
232.0
22.0
Table 10: Time to perform virtual memory operations on ExOS and Ultrix; times are in microseconds. The times for appel1 and appel2 are
per page.
OS
ExOS/ASH
ExOS
Ultrix
Ultrix/FRPC
3500
Roundtrip latency
259
320
3400
340
Machine
DEC5000/125
DEC5000/125
DEC5000/125
DEC5000/200
3250
3000
2750
2500
2250
2000
1750
1500
1250
1000
750
500
250
0
1
10
Number of Processes
Figure 2: Average roundtrip latency with increasing number of
active processes on receiver.
3. Message initiation. ASHs can initiate message sends, allowing for low-latency message replies.
Despite being measured on a slower machine, ExOS/ASH is
81 microseconds faster than a high-performance implementation of
RPC for Ultrix (FRPC) running on DECstation5000/200s and using
a specialized transport protocol [49]. In fact, ExOS/ASH is only 6
microseconds slower than the lower bound for cross-machine communication on Ethernet, measured on DECstation5000/200s [49].
Table 11 shows the roundtrip latency over Ethernet of ASHbased network messaging and compares it to ExOS without ASHs,
Ultrix, and FRPC [49] (the fastest RPC in the literature on comparable hardware). Roundtrip latency for Aegis and Ultrix was measured
by ping-ponging a counter in a 60-byte UDP/IP packet 4096 times
between two processes in user-space on DECstation5000/125s. The
FRPC numbers are taken from the literature [49]. They were measured on a DECstation5000/200, which is approximately 1.2 times
faster than a DECstation5000/125 on SPECint92.
%
Library operating systems, which work above the exokernel interface, implement higher-level abstractions and can define specialpurpose implementations that best meet the performance and functionality goals of applications. We demonstrate the flexibility of the
12
lrpc
13.9
10.4
6.3
2500
tlrpc
8.6
6.4
2.9
Machine
DEC2100
DEC3100
DEC5000
%
Ideal
Measured
2250
2000
1750
1500
1250
1000
750
500
250
0
0
Most RPC systems do not trust the server to save and restore
registers [27]. We implemented a version of lrpc (see Section 6.1)
that trusts the server to save and restore callee-saved registers. We
call this version tlrpc (trusted LRPC). Table 12 compares tlrpc to
ExOSs more general IPC mechanism,lrpc, which saves all generalpurpose callee-saved registers. Both implementations assume that
only a single function is of interest (e.g., neither uses the RPC
number to index into a table) and do not check permissions. Both
implementations are also single-threaded. The measurements show
that this simple optimization can improve performance by up to a
factor of two.
%
$%'
%
#$
30
40
50
60
70
80
90 100
20
Time (quanta)
Figure 3: Application-level stride scheduler.
10
$
Many early operating system papers discussed the need for extendible, flexible kernels [32, 42]. Lampsons description of CALTSS [31] and Brinch Hansens microkernel paper [24] are two classic rationales. Hydra was the most ambitious early system to have
the separation of kernel policy and mechanism as one of its central
tenets [55]. An exokernel takes the elimination of policy one step
further by removing mechanism wherever possible. This process
is motivated by the insight that mechanism is policy, albeit with one
less layer of indirection. For instance, a page-table is a very detailed
policy that controls how to translate, store and delete mappings and
what actions to take on invalid addresses and accesses.
%
Aegis includes a yield primitive to donate the remainder of a process current time slice to another (specific) process. Applications
can use this simple mechanism to implement their own scheduling
algorithms. To demonstrate this, we have built an application-level
scheduler that implements stride scheduling [54], a deterministic,
proportional-share scheduling mechanism that improves on recent
work [53]. The ExOS implementation maintains a list of processes
for which it is responsible, along with the proportional share they
are to receive of its time slice(s). On every time slice wakeup, the
scheduler calculates which process is to be scheduled and yields to
it directly.
VM/370 [17] exports the ideal exokernel interface: the hardware interface. On top of this hardware interface, VM/370 supports
a number of virtual machines on top of which radically different
operating systems can be implemented. However, the important
difference is that VM/370 provides this flexibility by virtualizing
the entire base-machine. Since the base machine can be quite complicated, virtualization can be expensive and difficult. Often, this
approach requires additional hardware support [23, 40]. Additionally, since much of the actual machine is intentionally hidden from
application-level software, such software has little control over the
actual resources and may manage the virtual resources in a counterproductive way. For instance, the LRU policy of pagers on top of
the virtual machine can conflict with the paging strategy used by
the virtual machine monitor [23]. In short, while a virtual machine
can provide more control than many other operating systems, application performance can suffer and actual control is lacking in key
areas.
13
Machine
DEC2100
DEC2100
DEC3100
DEC3100
Method
Original page-table
Inverted page-table
Original page-table
Inverted page-table
dirty
17.5
8.0
13.1
5.9
prot1
32.5
23.1
24.4
17.7
prot100
213.
253.
156.
189.
unprot100
275.
325.
206.
243.
trap
13.9
13.9
10.1
10.1
appel1
74.4
54.4
55.0
40.4
appel2
45.9
38.8
34.0
28.9
Table 13: Time to perform virtual memory operations on ExOS using two different page-table structures; times are in microseconds.
Modern revisitations of microkernels have argued for kernel extensibility [2, 43, 48]. Like microkernels, exokernels are designed
to increase extensibility. Unlike traditional microkernels, an exokernel pushes the kernel interface much closer to the hardware, which
allows for greater flexibility. An exokernel allows application-level
libraries to define virtual memory and IPC abstractions. In addition,
the exokernel architecture attempts to avoid shared servers (especially trusted shared servers), since they often limit extensibility.
For example, it is difficult to change the buffer management policy
of a shared file server. In many ways, servers can be viewed as fixed
kernel subsystems that run in user-space. Some newer microkernels push the kernel interface closer to the hardware [34], obtaining
better performance than previous microkernels. However, since
these systems do not employ secure bindings, visible resource revocation, and abort protocols, they give less control of resources to
application-level software.
#
Scout [25] and Vino [46] are other current extensible operating
systems. These systems are just beginning to be constructed, so it
is difficult to determine their relationship to exokernels in general
and Aegis in particular.
Third, traditional operating system abstractions can be implemented efficiently at application level. For instance, ExOSs
application-level VM and IPC primitives are much faster than Ultrixs corresponding primitives and than state-of-the-art implementations reported in the literature.
SPACE is a submicro-kernel that provides only low-level kernel abstractions defined by the trap and architecture interface [41].
Its close coupling to the architecture makes it similar in many ways
to an exokernel, but we have not been able to make detailed comparisons because its design methodology and performance have not
yet been published.
Fourth, applications can create special-purpose implementations of abstractions by merely modifying a library. We implemented several variations of fundamental operating system abstractions such as interprocess communication, virtual memory, and
schedulers with substantial improvements in functionality and performance. Many of these variations would require substantial kernel
alternations on todays systems.
$
)
14
' (
'
[16] D. D. Clark and D. L. Tennenhouse. Architectural considerations for a new generation of protocols. In ACM Communication Architectures, Protocols, and Applications (SIGCOMM)
1990, September 1990.
[1] M. B. Abbot and L. L. Peterson. Increasing network throughput by integrating protocol layers. IEEE/ACM Transactions
on Networking, 1(5):600610, October 1993.
[4] T.E. Anderson, B.N. Bershad, E.D. Lazowska, and H.M. Levy.
Scheduler activations: Effective kernel support for the userlevel management of parallelism. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles,
pages 95109, October 1991.
[5] A.W. Appel and K. Li. Virtual memory primitives for user
programs. In Fourth International Conference on Architecture
Support for Programming Languages and Operating Systems,
pages 96107, Santa Clara, CA, April 1991.
[7] K. Bala, M.F. Kaashoek, and W.E. Weihl. Software prefetching and caching for translation lookaside buffers. In Proceedings of the First Symposium on Operating Systems Design and
Implementation, pages 243253, November 1994.
[9] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. Fiuczynski, D. Becker, S. Eggers, and C. Chambers. Extensibility,
safety and performance in the SPIN operating system. In
Proceedings of the Fifteenth ACM Symposium on Operating
Systems Principles, December 1995.
[26] K. Harty and D.R. Cheriton. Application-controlled physical memory using external page-cache management. In Fifth
International Conference on Architecture Support for Programming Languages and Operating Systems, pages 187199,
October 1992.
[10] P. Cao, E. W. Felten, and K. Li. Implementation and performance of application-controlled file caching. In Proceedings
of the First Symposium on Operating Systems Design and
Implementation, pages 165178, November 1994.
[27] W.C. Hsieh, M.F. Kaashoek, and W.E. Weihl. The persistent
relevance of IPC performance: New techniques for reducing
the IPC penalty. In Fourth Workshop on Workstation Operating Systems, pages 186190, October 1993.
[47] M. Stonebraker. Operating system support for database management. Communications of the ACM, 24(7):412418, July
1981.
[49] C. A. Thekkath and H. M. Levy. Limits to low-latency communication on high-speed networks. ACM Transactions on
Computer Systems, 11(2):179203, May 1993.
[51] T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser.
Active messages: a mechanism for integrated communication
and computation. In Proceedings of the 19th International
Symposium on Computer Architecture, pages 256267, May
1992.
[36] H. Massalin and C. Pu. Threads and input/output in the Synthesis kernel. In Proceedings of the Twelfth ACM Symposium
on Operating Systems Principles, pages 191201, 1989.
[52] R. Wahbe, S. Lucco, T. Anderson, and S. Graham. Efficient software-based fault isolation. In Proceedings of the
Fourteenth ACM Symposium on Operating Systems Principles, pages 203216, December 1993.
[37] J.C. Mogul, R.F. Rashid, and M.J. Accetta. The packet filter: An efficient mechanism for user-level network code. In
Proceedings of the Eleventh ACM Symposium on Operating
Systems Principles, pages 3951, November 1987.
[53] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flexible proportional-share resource management. In Proceedings
of the First Symposium on Operating Systems Design and
Implementation, pages 111, November 1994.
[54] C. A. Waldspurger and W. E. Weihl. Stride scheduling: deterministic proportional-share resource management. Technical
Memorandum MIT/LCS/TM528, MIT, June 1995.