The Design and Implementation of A Fully-Modular S
The Design and Implementation of A Fully-Modular S
net/publication/228579596
Article
CITATION READS
1 56
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Philip Homburg on 20 May 2014.
Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum
Dept. of Computer Science, Vrije Universiteit Amsterdam, The Netherlands
{jnherder, herbertb, bjgras, philip, ast}@cs.vu.nl
1
We believe that we are the first to realize a fully- 85% of all operating system crashes are caused by de-
modular, open-source, POSIX-conformant operating vice drivers [2, 20], running untrusted, third-party code
system with self-healing properties. Although we pri- in the kernel also diminishes the system’s reliability.
marily use a multiserver architecture because of its re- From a high-level reliability perspective, a monolithic
liability, we will show that the system has many other kernel is unstructured. The kernel may be partitioned
benefits as well, for example, for system administration into domains but there are no protection barriers en-
and programming. The system has been released (with forced between the components. Two simplified exam-
all the source code) and over 100,000 people have down- ples, Linux and MacOS X, are given in Fig. 1.
loaded it so far, as discussed later.
User
User User User User User User User User
Kernel space
VFS Inet UNIX server
We first introduce how operating system structures have
evolved over time (Sec. 2). Then we proceed with a de- Drivers Paging Microkernel
tailed discussion of the kernel (Sec. 3) and the organiza-
tion of the user-mode servers on top of it (Sec. 4). We Linux Mac OS X
review how our multiserver operating system realizes a (a) (b)
dependable computing platform and highlight some ad-
ditional benefits (Sec. 5), and briefly discuss its perfor- Figure 1: Two typical monolithic systems: (a) Vanilla Linux
mance (Sec. 6). In the end, we survey related work and (b) Mac OS X. Their properties are discussed in Sec. 2.1.
(Sec. 7), and draw conclusions (Sec. 8).
tions of the underlying hardware. All operating system User User User User User User User User
Drivers
Monolithic designs have some inherent problems that Sign
Drivers Paging
affect their reliability. All operating system code, for ex-
ample, runs at the highest privilege level without proper Mach 3 L4
fault isolation, so that any bug can potentially trash the (a) (b)
entire system. With millions of lines of code (LoC)
and 1-16 bugs per 1000 LOC [22, 23], monolithic sys- Figure 2: Two typical single-server systems: (a) Mach-UX
tems are likely to contain many bugs. Since 70% to and (b) Perseus. Their properties are discussed in Sec. 2.2.
2
2.3 Multiserver Systems 3 THE KERNEL ARCHITECTURE
In a multiserver design, the operating system environ-
The kernel is responsible for low-level functionality that
ment is formed by a set of cooperating servers. Un-
cannot be handled in user space, such as IPC, process
trusted, third-party code such as device drivers can be run
scheduling, and interrupt handling. Ours consists of
in separate, user-mode modules to prevent faults from
fewer than 4000 lines of code (LoC), which makes it
spreading. High reliability can be achieved by applying
easy to understand. The kernel provides only the most
the principle of least authority [15], and tightly control-
elementary mechanisms, whereas the user-mode servers
ling the powers of each module.
implement the policies that drive the operating system.
A multiserver design also has other advantages. The
modular structure, for example, makes system admin-
istration easier and provides a convenient programming 3.1 Interprocess Communication
environment, as discussed in Sec. 5.
Several multiserver operating systems exist. An early Interprocess communication (IPC) is of crucial impor-
system is MINIX [21], which distributed operating sys- tance in a multiserver system. IPC allows user processes
tem functionality over two user-mode servers, but still to request operating system services and enables cooper-
ran the device drivers in the kernel, as shown in Fig. 3(a). ation between servers and drivers. We compared many
More recently, IBM Research designed SawMill alternatives to find a suitable set of IPC primitives that
Linux [4], a multiserver environment on top of the L4 mi- is simple, efficient, and reliable. Finding IPC primi-
crokernel, as illustrated in Fig. 3(b). While the goal was a tives that do not hang the system when message senders
full multiserver variant of Linux, the project never passed and receivers crash during a request-reply sequence is far
the stage of a rudimentary prototype, and was then aban- from trivial. Consequently, our primitives have evolved
doned when the people working on it left IBM. as the system matured. In this paper we describe the
primitives of the version currently in test; these differ in
User space
User User User User User User User User minor ways from those in previous versions.
Our IPC communication is characterized by ren-
Mem FS Net FS Mem dezvous message passing using small, fixed-length mes-
sages. Rendezvous is a two-way interaction without
Kernel space
3
Finally, the IPC SELECT primitive can be used to re- send its requests to the file server instead. These restric-
ceive a specified message. The caller can pass the mes- tions also help to prevent deadlocks.
sage source or set of events it is interested in, and will be Third, we restrict the use of event notifications. Only
blocked until such a message arrives. trusted processes, such as the process manager and file
Messages are prioritized. Event notifications, such as server, can use them. Callback events, in contrast, are
hardware interrupts and timeouts, have the highest prior- also available to untrusted processes, such as drivers.
ity. Callback notifications have a lower priority, as they All these protection mechanisms are implemented by
cannot be delivered together with other notifications. Fi- means of bitmaps that are statically declared as part of
nally, request messages have the lowest priority. the process table. This is space efficient, prevents re-
source exhaustion, and allows for fast permission checks
since only simple bit operations are required.
Design Principles We designed the IPC primitives to
be simple, efficient, and reliable to reduce the amount of
code and increase understandability. Complicated opti- 3.2 Process Scheduling
mizations to improve resource usage often lead to com-
Scheduling is done using a fixed number of prioritized
plex, buggy code, so we have tried to keep the code
queues. The processes on each queue are kept in a linked
straightforward. For example, only small, fixed-length
list, and are scheduled round robin. The scheduler sim-
messages are used. Messages are a union of different
ply finds the highest populated queue and selects the first
message types, and their size is determined at compile
process on it to run.
time as the largest of all types in the union.
Whenever a process becomes ready, it is put on the
To ensure messages are reliably delivered to the right
head of its queue when it still has some quantum left. A
destination, IPC endpoints are under the control of the
process goes to the rear of its queue only when it has no
kernel. An IPC endpoint is formed by the combination of
quantum left. While somewhat counterintuitive, it works
a process’ slot number and the slot’s generation number,
because this ensures that processes doing a system call
which increased with each new process. This ensures
are not moved to the rear of the queue. It also makes the
that IPC directed to an exited process cannot end up at a
system responsive since processes that were blocked for
process that reuses a slot.
I/O can immediately run once the I/O is done.
Our IPC design eliminates the need for dynamic re-
Each time a process consumes a full quantum it de-
source allocation, both in the kernel and in user space.
grades in priority. Periodically the priority of all pro-
The standard request-reply sequence uses a rendezvous,
cesses is upgraded to prevent that all processes end up
so that no message buffering is needed. If the destination
in the lowest-priority queue. Since I/O bound processes
is not waiting, IPC REQUEST blocks the sender. Simi-
consume fewer quanta than CPU-bound processes, they
larly, a receiver is blocked on IPC SELECT when no IPC
will have a higher average priority, and are likely to be
is available. Messages are never buffered in the kernel,
scheduled when the I/O finishes.
but always directly copied from sender to receiver. No
additional copies are required, speeding up IPC.
The asynchronous IPC NOTIFY mechanism is also not 3.3 Interrupt Handling
susceptible to resource exhaustion. Event notifications
Another important responsibility of the kernel is inter-
are typed and at most one bit per type is saved. All pend-
rupt handling. Because this cannot be done in user space,
ing notifications can be stored in a compact bitmap that is
we disentangled interrupt handlers and device drivers.
statically declared as part of the process table. Multiple
User-mode device drivers can only instruct the kernel to
pending notifications of the same type are merged. Al-
transform specific interrupts into notification messages
though the amount of information that can be passed this
and must do all further processing themselves. After reg-
way is limited, this design was chosen for its simplicity,
istration, drivers can tell the kernel to enable and disable
reliability and low memory requirements.
hardware interrupts.
The kernel catches all hardware interrupts with a
Protection Mechanisms Since IPC is a powerful con- generic interrupt handler that looks up which drivers are
struct, we included several mechanisms to restrict who associated with the IRQ line, and sends a nonblocking
can do what. First, we restrict the set of IPC primitives notification message to each of them. As a side-effect,
available to each process. User processes, for example, the generic interrupt handler gathers randomness for the
are allowed to use only IPC REQUEST. random number generation device.
Second, we restrict who can request services from The only exception to the above is that the clock driver
whom. A user process doing I/O, for example, cannot (CLOCK) defines its own handler as part of the kernel,
communicate directly with device drivers, but needs to as discussed below. This handler is simple and usually
4
does only accounting. When more work is needed, such interrupt occurs, the currently running kernel task or user
as scheduling another process, a notification is sent to process is preempted, the interrupt is transformed into a
CLOCK for further processing. message, and another process is scheduled. If the inter-
All the real work is done at the process level, usually in rupt is for a high-priority device, the associated driver
a user-mode device driver, but sometimes also in a kernel is likely to be scheduled soon. Building a complete
task. This helps to achieve a low interrupt latency—since real-time operating system would require extending the
processes can be preempted—and makes the system suit- scheduler with real-time primitives.
able for real-time applications.
Proc Proc
Proc
brary are transformed into request messages that are sent
to SYS, which processes the requests, and sends reply
messages. SYS never takes initiative by itself, but it is
Kernel space
SYS CLK The kernel calls handled by SYS can be grouped into
several categories, including process management, mem-
ory management, copying data between processes, de-
Figure 4: The kernel closely follows the process-oriented mul- vice I/O and interrupt management, access to kernel data
tiserver design. Services can be requested with ordinary, syn- structures, and clock services. An overview of common
chronous request messages. Kernel events are transformed into
kernel calls is given in Fig. 5.
asynchronous notification messages.
Kernel Call Purpose
Apart from the actual service requested, a kernel call SYS FORK Fork a process; copy parent slot
from a user-mode server to a kernel task is similar to a SYS EXEC Execute a process; initialize slot
system call from a user process to an operating system SYS EXIT Exit a process; clear process slot
server. Both calls use the IPC primitives discussed in SYS NEWMAP Assign memory segment to process
Sec. 3.1 and result in a synchronous request message be- SYS VIRCOPY Copy data using virtual addressing
ing sent from one process to another. The services pro- SYS DEVIO Read or write a single I/O port
vided by the kernel tasks are discussed below. SYS IRQCTL Set or reset an interrupt policy
Only a tiny fraction of the kernel is responsible for SYS PRIVCTL Assign system process’ privileges
SYS GETINFO Get a copy of kernel information
handling hardware interrupts, IPC traps, and exceptions.
SYS TIMES Get process times or kernel uptime
Whenever such an event happens the CPU saves the state
SYS SETALARM Set or reset a synchronous alarm
of the currently-running process, and invokes the associ-
ated service routine that has been registered by the ker-
nel. When the event has been processed the kernel picks Figure 5: A selection of common kernel calls. All calls require
a (possibly different) process to run, restores the process’ privileged operations and are handled by SYS.
state, and tells the CPU to resume normal execution. To
keep the kernel simple, kernel reentries are forbidden.
3.5.2 Clock Task (CLOCK)
Real-Time Properties Because the kernel is process
oriented, it forms a suitable base for real-time systems. CLOCK is responsible for accounting of CPU usage,
The kernel is locked only when this is absolutely re- scheduling another process when a process’ quantum
quired to prevent race conditions. Whenever a hardware expires, managing watchdog timers, and interacting
5
with the hardware clock. It does not have a publicly- User User User
accessible user interface like SYS.
OS
When the system starts up, CLOCK programs the hard-
User space
Interface
FS PM RS
ware clock’s frequency and registers an interrupt handler
that is run on every clock tick. The handler does only
basic integer operations, that is, it increments a process’ driver MM DS
Kernel space
KERNEL
is due, a notification is sent to CLOCK to do the real work
SYS CLK
at the task level. This minimizes the hardware interrupt
latency, because kernel tasks can be preempted.
Although CLOCK has no direct interface from user
Figure 6: The core components of the full multiserver operat-
space, its services can be accessed through the ker- ing system, and some typical IPC paths. Top-down IPC uses
nel calls handled by SYS. The most important call is synchronous requests, whereas bottom-up IPC is done with
SYS ALARM that allows system processes to schedule asynchronous notifications.
a synchronous alarm that causes a ‘timeout’ notifica-
tion upon expiration. The alarm is synchronous because
CLOCK the notification message is only delivered when For character devices the user process may be suspended
the client indicates it is ready to receive it. until the driver notifies FS that the data is ready.
As an aside, the POSIX alarm that is available to or- (3) Additional servers and drivers can be started on the
dinary user applications is handled by the user-mode fly by requesting the reincarnation server (RS). RS then
process manager server, and causes an asynchronous forks a new process, assigns all needed privileges, and,
SIGALRM signal. finally, executes the given path in the child process (not
shown in the figure). Information about the new system
4 THE USER-MODE SERVERS process is published in the data store (DS), which allows
parts of the operating system to subscribe to updates in
On top of the kernel we have implemented a multiserver the operating system configuration.
operating system. The core components of this sys- (4) Although not a system call, it is interesting to
tem are shown in Fig. 6. Apart from the device drivers see what happens if a user or operating system pro-
and user processes this constitutes the trusted computing cess causes an exception, for example, due to an invalid
base. Most of the servers are relatively small and sim- pointer. In this event, the kernel’s exception handler no-
ple. The sizes approximately range from 1000 to 3000 tifies PM, which transforms the exception into a signal or
LoC per server, which makes them easy to understand kills the process when no handler is registered. Recovery
and maintain. The components are discussed below. in case of operating system failures is discussed below.
We first give some examples to illustrate how our mul-
tiserver operating system actually works. Fig. 6 also
shows some typical IPC interactions initiated by user Design Principle The general design principle that led
processes. Although the POSIX operating system inter- to the above set of servers is that each process should be
face is implemented by multiple servers, system calls are limited to its core business. Having small, well-defined
transparently targeted to the right server by the system services helps to keep the implementation simple and
libraries. Four examples are given below: understandable. As in the original UNIX philosophy,
(1) The user process that wants to create a child pro- each server has limited responsibility and power, as is
cess calls the fork() library function, which sends a re- reflected in its name.
quest message to the process manager (PM). PM verifies For example, FS must interact with drivers, but should
that a process slot is available, asks the memory man- not be checking for weird driver failures like non-
ager (MM) to allocate memory, and instructs the kernel responsiveness. Although FS could manage driver time-
to create a copy of the process. outs itself, this would complicate its design and imple-
(2) A read() or write() call, in contrast, is sent to FS. mentation. Therefore, it relies on a separate component,
If the requested block is available in the buffer cache, RS, which is responsible for the system’s well-being. If
FS asks the kernel to copy it to the user. Otherwise it first FS hangs on a driver, RS will detect that the driver is
sends a message to the disk driver asking it to retrieve the not responding, and will kill the driver and revive FS.
block from disk. The driver sets an alarm, commands the Although killing a nonresponsive driver seems harsh, a
disk controller through a device I/O request to the kernel, properly designed driver must adhere to the protocol and
and awaits the hardware interrupt or timeout notification. return an error code to FS if it cannot fulfill the request.
6
Multiserver Protocol Although our IPC facilities are 4.1.2 File Server (FS)
fairly reliable by design, as discussed in Sec. 3.1, they
cannot prevent deadlocks. Therefore, we devised a mul- FS manages the file system. It is an ordinary file server
tiserver protocol that ensures that synchronous messages that handles standard POSIX calls such as open(), read(),
are sent in one only direction. There is a loose layering and write(). More advanced functionality includes sup-
based on who can make requests to whom. User pro- port for symbolic links and the select() system call. FS is
cesses can call the servers, servers can call each other also the interface to the network server.
and drivers, and servers and drivers can call the kernel. For performance reasons, file system blocks are
IPC in the opposite direction is done using the nonblock- buffered in FS’ buffer cache. To maintain file system
ing notification mechanism. consistency, however, crucial file system data structures
Another aspect of the multiserver protocol is that we use write-through semantics, and the cache is periodi-
try to minimize copying to prevent loss of performance. cally written to disk.
The number of copies required for I/O is precisely the Since the file server runs as an isolated process that
same as in a monolithic system. Although there are more is fully IPC driven, it can be replaced with a different
context switches because a user process, the file server, one to serve other file systems, such as FAT. Moreover, it
and a driver must interact, no intermediate copies are should be relatively straightforward to transform FS into
required. For example, I/O for character devices is no a network file server that runs on a remote host.
buffered. All data is directly copied between the user
process and the device driver.
Device Driver Handling Because device drivers can
be dynamically configured, FS maintains a table with
4.1 Core Components the mapping of major numbers onto specific drivers. As
discussed below, FS is automatically notified of changes
This section discusses the core components shown in
in the operating system configuration through a publish-
Fig. 6. We will focus on design decisions that are spe-
subscribe system. This decouples the file server and the
cific to the multiserver aspects of the system.
drivers it depends on.
A goal of our research is to automatically recover
4.1.1 Process Manager (PM) from common driver failures without human interven-
PM is responsible for process management such as cre- tion. When a disk driver failure is detected, the sys-
ating and removing processes, assigning process identi- tem can recover transparently by replacing the driver and
fiers and priorities, and controlling the flow of execution. rewriting the blocks from FS’ buffer cache. For character
Furthermore, PM maintains relations between processes, devices, transparent recovery sometimes is also possible.
such as process groups and parent-child blood lines. The Such failures are pushed to user space, but may be dealt
latter, for example, has consequences for exiting pro- with by the application if the I/O request can be reissued.
cesses and accounting of CPU time. A print job, for example, can be reissued by the print
Although the kernel provides mechanisms, for exam- spooler system.
ple, to set up the CPU registers, PM implements the pro-
cess management policies. As far as the kernel is con-
4.1.3 Memory Manager (MM)
cerned all processes are similar; all it does is schedule
the highest-priority ready process. To facilitate ports to different architectures, we use
a hardware-independent, segmented memory model.
Signal Handling PM is also responsible for POSIX Memory segments are contiguous, physical memory ar-
signal handling. When a signal is to be delivered, by eas. Each process has a text, stack, and data segment.
default, PM either ignores it or kills the process. Ordi- System processes can be granted access to additional
nary user processes can register a signal handler to catch memory segments, such as the video memory or the
signals. In this case, PM interrupts pending system calls, RAM disk memory. Although the kernel is responsible
and puts a signal frame on the stack of the process to run for hiding the hardware-dependent details, MM does the
the handler. This approach is not suitable for system pro- actual memory management.
cesses, however, as it interferes with IPC. Therefore, we MM maintains a list of free memory regions, and can
implemented an extension to the POSIX sigaction() call allocate or release memory segments for other system
so that system processes can request PM to transform sig- services. Currently MM is integrated into PM and pro-
nals into notification messages. Since event notification vides support for Intel’s segmented memory model, but
messages have the highest priority of all message types, work is in progress to split it out and offer limited virtual
signals are delivered promptly. memory capabilities, for example, shared libraries.
7
We will not support demand paging, however, because ponent, which is useful, for example, for development
we believe physical memory is no longer a limited re- purposes. Another policy might use a binary exponen-
source in most domains. We strive to keep the code sim- tial backoff protocol when restarting components to pre-
ple and eliminate complexity whenever possible. Swap- vent clogging the system due to repeated failures. In any
ping segments to disk would be easy to add, but in the event, the problems are logged so that the system admin-
interest of simplicity we have not done so. istrator can always find out what happened. Optionally,
an e-mail can be sent to a remote administrator.
4.1.4 Reincarnation Server (RS) Failed components can be restarted from a fresh copy
on disk except for the disk driver, which is restarted from
RS is the central component responsible for managing a copy kept in RAM.
all operating system servers and drivers. While PM is
responsible for process management in general, RS deals
with only privileged processes. It acts as a guardian and 4.1.5 Data Store (DS)
ensures liveness of the operating system.
Administration of system processes also goes through DS is a small database server with publish-subscribe
RS. A utility program, service, provides the user with a functionality. It serves two purposes. First, system pro-
convenient interface to RS. It allows the system adminis- cesses can use it to store some data privately. This redun-
trator to start and stop system services, (re)set their poli- dancy is useful in the light of fault tolerance. A restarting
cies, or gather statistics. For optimal flexibility in spec- system service, for example, can request state that it lost
ifying policies a shell script can be set to run on certain when it crashed. Such data is not publicly accessible.
events, including device driver crashes. Second, the publish-subscribe mechanism is the glue
between operating system components. It provides a
Fault Set The fault set that RS deals with are protocol flexible interaction mechanism and elegantly reduces de-
errors, transient failures, and aging bugs. Protocol errors pendencies by decoupling producers and consumers. A
mean that a system process does not adhere to the mul- producer can publish data with an associated identifier.
tiserver protocol, for example, by failing to respond to a A consumer can subscribe to selected events by specify-
request. Transient failures are problems caused by spe- ing the identifiers or regular expressions it is interested
cific configuration or timing issues that are unlikely to in. Whenever a piece of data is updated DS automat-
happen. Aging bugs are implementation problems that ically broadcasts notifications to all dependent compo-
cause a component to fail over time, for example, when nents. Although we currently do not do this, in future,
it runs out of buffers due to memory leaks. drivers could announce every request message and I/O
Logical errors where a server or driver perfectly ad- completion to DS. In this manner, if a driver crashes, its
heres to the specified system behavior but fails to per- replacement could find out what work was pending, sim-
form the actual request are excluded. An example of a ilar to the shadow drivers in Nooks [19].
logical error is a printer driver that accepts a print job and
confirms that the printout was successfully done, but, in
fact, prints garbage. Such bugs are virtually impossible Naming Service IPC endpoints are formed by the pro-
to catch in any system. cess and generation numbers, which are controlled and
managed by the kernel. Because every process has a
Fault Detection and Recovery During system initial- unique IPC endpoint, system processes cannot easily find
ization RS adopts all processes in the boot image as its each other. Therefore, we introduced stable identifiers
children. System processes that are started later, also be- that consist of a natural language name plus an optional
come children of RS. This ensures immediate crash de- number. The identifiers are globally known. Whenever
tection, because PM raises a SIGCHLD signal that is de- a system process is (re)started RS publishes its identifier
livered at RS when a system process exits. and the associated IPC endpoint at DS for future lookup
In addition, RS can check the liveness of the system. by other system services.
If the policy says so, RS does a periodic status check, In contrast to earlier systems, such as Mach, our nam-
and expects a reply in the next period. Failure to respond ing service is a higher-level construction that is realized
will cause the process to be killed. The status requests in user space. Mach provided stable IPC endpoints in the
and the consequent replies are sent using a nonblocking kernel, namely the ‘port’ mechanism, to which a client
event notification. and server could attach. This required bookkeeping in
Whenever a problem is detected, RS can replace the the kernel and did not solve the problems introduced by
malfunctioning component with a fresh copy. The asso- exiting and reappearing system services. We have inten-
ciated policy script, however, might not restart the com- tionally pushed all this complexity to user space.
8
Error Handling Since fault tolerance is an explicit de- structures at any time. Examples include the process ta-
sign goal, the naming service is an integral part of the de- ble of the kernel or PM, the device driver mappings at FS,
sign. The publish-subscribe mechanism of DS makes it and the status of privileged processes at RS.
very suitable to inform other processes of changes in the
operating system. Moreover, recovery of, say, a driver is 4.2.2 Network Server (INET)
made explicit to the services that depend on it.
For example, FS subscribes to the identifier for the disk The network server, INET, implements TCP/IP in user
drivers. When the system configuration changes, DS no- space. The interface offered to the application program-
tifies FS about the event. FS then calls back to find out mer are BSD sockets. As with other I/O handles, sockets
what happened. If FS discovers that a driver has been are managed by the file server, but FS transparently for-
restarted, it tries to recover transparently to the user. wards networking requests to INET, which manages both
TCP streams and UDP datagrams.
4.1.6 Device Drivers Like FS, INET requests DS to be notified about con-
figuration of Ethernet drivers, and can handle driver
All operating systems hide the raw hardware under a crashes. Since the TCP protocol prescribes retransmis-
layer of device drivers. Consequently, we have imple- sions of lost packets (and lost datagram are explicitly al-
mented drivers for ATA, S-ATA, floppy, and RAM disks, lowed by the UDP protocol), INET can fully recover from
keyboards, displays, audio, printers, serial line, various Ethernet driver failures transparent to the user.
Ethernet cards, etc.
Although device drivers can be very challenging, tech-
4.2.3 X Window System (X)
nically, they are not very interesting in the operating sys-
tem design space. What is important, though, is each of To demonstrate that our ideas are practical in real UNIX-
ours runs as an independent user-mode process to prevent like systems, we have ported a recent version of the X
faults from spreading outside its address space and make Window System. X provides a client-server interface
it easy to replace a crashed or looping driver without a between display hardware and the desktop environment.
reboot. This is the self-healing property referred to in the We have successfully run large GUI applications, includ-
title. While other people have measured the performance ing the Firefox browser, over a network. A downside of
of user-mode drivers [10], no currently-available system X is that it is a large, monolithic window system, but it
is self-healing like this. clearly shows that our system can run real-world soft-
We are obviously aware that not all bugs can be elim- ware. Future work might include porting a small, modu-
inated by restarting a failed driver, but since the bugs lar window system.
that make it past driver testing tend to be timing bugs
or memory leaks rather than algorithmic bugs, a restart
often does the job. 5 BENEFITS OF THIS DESIGN
Although the main reason for using a multiserver archi-
4.2 Optional Components tecture is achieving extremely high reliability, there are
In addition to the core components discussed above, sev- other advantages of this approach. In this section, we will
eral operating system services are started on the fly with highlight some benefits of our system for programmers,
help of RS. The most important ones are discussed here. system administrators, and end users.
9
No need for a reboot and a check of the file system. ample, in the design of our device drivers we use shared
Crashes result in core dumps that can be inspected and library code. This helps to improve reliability since the
debugged using normal tools. code is thoroughly tested by the many drivers that use it.
The programming model is more convenient, because We also postpone initialization of drivers until their first
the user-mode programming model is much closer to use so that they cannot hang the system at boot time.
the POSIX standard than the restricted kernel API. This The use of separate memory segments protects against
lowers the barrier for experimentation by new users and many types of buffer overruns. Code injection is no
might improve driver quality. Furthermore, user-mode longer possible because the text segment is read-only and
drivers enforce a proper design respecting interfaces, the stack and data segment are not executable. Conse-
which leads to cleaner code. quently, even if a buffer overrun injects a worm or virus
Finally, there is good accountability. RS logs all com- onto the stack, this code cannot be executed. While other
ponent crashes so that it is clear where the error occurred. types of attacks exist, for example, the return-to-libc at-
This might have legal liability implications for the devel- tack, they are harder to exploit.
oper, and lead to more carefully crafted drivers.
5.2 System Administration Problem Detection Since all operating system ser-
vices run as separate user-mode process, we can de-
The multiserver approach also makes system administra- tect many problems, just like we can for ordinary appli-
tion much easier due to the presence of many small, well- cations. Invalid pointers or illegal access attempts are
understood, self-contained modules instead of a massive caught by MMU and will cause a signal from PM. The
kernel. As mentioned above, the kernel consists of less scheduler’s feedback mechanism tames infinite loops by
than 4000 LoC and the core servers are even smaller. lowering a process’ priority. Furthermore, a process’ se-
This improves the system’s maintainability, because the curity policy is checked whenever it makes a system call.
components are easy to understand and can be main-
tained independently from each other, as long as the in-
terfaces between them are respected. Security Policies Each user, server, and driver process
Because the operating system can be dynamically con- has an associated policy that specifies what it can do.
figured, a system administrator can quickly respond to Only the minimal privileges needed to perform the task
security breaches. Instead of applying a patch and re- are given, according to the principle of least authority. In
booting the system, a malfunctioning device driver can contrast, in a monolithic operating system it usually is
be replaced with a new one on the fly. This allows up- not possible to precisely restrict individual components.
dates without loss of service or downtime. Driver access to I/O ports can be limited by a range
stored in the kernel’s process table. In this way, if, say,
Configurability The multiserver model makes it easy the printer driver tries to write to the disk’s I/O ports, the
to configure a system. The core components are always kernel can prevent the access. Stopping rogue DMA is
present to provide the basic operating system services, not possible with current hardware, but as soon as an I/O
but optional components can be installed and loaded at a MMU is added, we can prevent that, too.
later time, without having to reboot the system. Furthermore, we can tightly restrict the IPC capabili-
Small consumer appliances, such as mobile phones ties of each process, as discussed in Sec. 3.1. User appli-
and PDAs, and embedded systems need a small, con- cations, for example, can use only IPC REQUEST and to
figurable operating system. Trying to squeeze a large only a subset of the operating system servers.
monolithic system like Linux into a small device requires
much more effort than mixing and matching parts from a
modular system that started small. Fault Detection and Recovery Sec. 4.1 discussed how
RS deals with crashes at the operating system level. RS
5.3 End-User Reliability provides immediate crash detection and does periodic
status checks if the policy says so. Depending on the
In the previous sections we already discussed how multi- policy, failing components are automatically restarted.
ple, independent servers and drivers help to improve the If a block device driver fails FS can provide transparent
end-user reliability of an operating system. Our system’s recovery to the application level by flushing its buffer
major reliability features are briefly summarized below. cache. For character device drivers the error is pushed to
the user level, where transparent recovery is sometimes
Structural Measures The system is designed to pre- possible. Typically, daemons need to be rewritten to retry
vent problems from occurring in the first place. For ex- instead of giving up on the first failure.
10
6 PERFORMANCE 7.1 Virtual Machines and Exokernels
Virtual machines [16] and exokernels [3] do not provide
Multiserver systems based on microkernels have been
a hardware abstraction layer like (other) operating sys-
criticized for decades because of alleged performance
tems. Instead, they respectively duplicate or partition
problems. We argue that a modular system need not
available hardware resources so that multiple operating
be slow due to additional copying and context-switching
systems can run next to each other with the illusion of
overhead introduced when multiple servers cooperate to
having a private machine. A virtual machine monitor
perform a task. While this was the case for some early
or exokernel runs in kernel mode and is responsible for
microkernel systems, current multiserver systems have
the protection of resources and multiplexing hardware re-
competitive performance.
quests, whereas each operating system runs in user mode,
To illustrate the case, BSD UNIX on top of the early
fully isolated from each other. These technologies pro-
Mach microkernel was well over 50% slower than the
vide an interface to an operating system, but do not repre-
normal version of BSD UNIX, and led to the impres-
sent a complete system by themselves. Neither approach
sion of microkernels being slow. Modern microkernels,
solves the problem we try to tackle: how to build a reli-
however, have proven that high performance actually can
able operating system that can heal itself after a fatal bug
be realized. L4 Linux on top of L4, for example, has
has been triggered in a device driver or server.
a performance loss of about 5% [6]. Another project
recently demonstrated that a user-mode gigabit Ether-
net can achieve the same performance as a kernel-mode 7.2 Monolithic Systems
driver up to 750 Mbps. Above that throughput for the
A monolithic system runs the entire operating system in
user-mode driver dropped by 7% [10].
kernel mode without proper fault isolation. Although
We have done extensive measurements of our system
these properties negatively affect the system’s reliability,
and presented the results in a technical report [7]. We can
as discussed in Sec. 2.1, many operating systems have a
summarize these results (done on a 2.2 GHz Athlon) as
monolithic design.
follows. The simplest system call, getpid, takes 1.011 mi-
croseconds, which includes two messages and four con-
text switches. Rebuilding the full system, which is heav- Windows XP and Vista Microsoft Windows XP is an
ily disk bound, has an overhead of 7%. Jobs with mixed example of a monolithic system that runs the entire op-
computing and I/O, such as sorting, sedding, grepping, erating system in kernel mode. Although Microsoft once
prepping, and uuencoding a 64-MB file have overheads tried to put some components to user space, the perfor-
of 4%, 6%, 1%, 9%, and 8%, respectively. The system mance penalty was deemed too high.
can do a build of the kernel and all user-mode servers and Faster hardware and consumer demands for reliabil-
drivers in the boot image within 6 sec. In that time it per- ity made Microsoft revisit this design decision. In Vista,
forms 112 compilations and 11 links (about 50 msec per Microsoft has plans to run many device drivers and the
compilation). Fast Ethernet easily runs at full speed, and graphics subsystem in user mode, thus demonstrating
initial tests show that we can also drive gigabit Ethernet that Microsoft has come to the same insight that we have:
at full speed. Finally, the time from exiting the multiboot user-mode drivers are the way to go.
monitor to the login prompt is under 5 sec.
It has to be noted that the prototype incorporates many Linux and Isolated Drivers Linux has basically the
new security checks that cause some overhead. Further- same monolithic style as Windows. One major differ-
more, we did not yet do any performance optimizations. ence with Windows is that the graphics subsystem in
Careful analysis and removal of bottlenecks may boost Linux—like other UNIX system—has always been in
the performance. We believe a performance penalty of user space. As the system ages, it acquires more and
less than 5% is realistic. more functionality that ends up in the kernel, with all
consequences for maintainability and reliability.
An important project to improve the reliability of com-
7 RELATED WORK modity systems such as Linux is Nooks [19, 20]. Nooks
keeps device drivers in the kernel but transparently en-
In this section we review some related operating systems. closes them in a kind of lightweight protective wrapper
Note that we survey complete operating systems and not so that driver bugs cannot propagate to other parts of the
individual kernels to make the comparison fair. In other operating system. All traffic between the driver and the
words, we compare our complete POSIX-conformant rest of the kernel is inspected by the reliability layer.
operating system to other systems that provide a com- Another project uses virtual machines to isolate de-
parable full system call interface. vice drivers from the rest of the system [11, 12]. When
11
a driver is called, it is run on a different virtual machine 7.3 Single-Server Systems
than the main system so that a crash or other fault does
not pollute the main system. In addition to isolation, While these systems are more modular than the previ-
this technique enables unmodified reuse of device drivers ous examples, the operating system still runs as a huge,
when experimenting with new operating systems. monolithic server. Several systems follow this design.
A recent project ran Linux device drivers in user mode
with small changes to the Linux kernel [10]. This work L4 Linux-based Systems A typical example of a
shows that drivers can be isolated in separate user-mode single-server system is L4 Linux, in which Linux is run
processes without significant performance degradation. on top of the L4 microkernel. User processes obtain
While isolating device drivers helps to improve the operating system services by making remote procedure
reliability of legacy operating systems, we believe a calls to the Linux server using L4’s IPC mechanism.
proper, modular design from scratch gives better results. Measurements show the performance penalty over native
This includes encapsulating all operating system com- Linux to be about 5% [6].
ponents (e.g., file system, memory manager) in indepen- A real-time system built with help of L4 Linux is
dent, user-mode processes. DROPS [5]. It is targeted toward multimedia applica-
tions. However, most of the device drivers still run as
part of a big L4 Linux server, with only the multimedia
Mac OS X Apple MacOS X is yet another example
subsystem running separately.
of a monolithic kernel, but it has a layered kernel struc-
ture. The lowest layer is a microkernel based on Mach. Perseus [13] is another L4 Linux-based system. It
The BSD UNIX personality is part of the kernel, as was designed to provide secure digital signatures while
are various other components. This makes it simply a still supporting legacy applications for Linux. L4 Linux
differently-structured monolithic kernel, but does not add is used for most operations, but whenever a document
many reliability features. needs to be signed control is given to a trusted subsys-
tem that includes a signature server.
The problem with all these systems is that a single bug
VxWorks VxWorks is a POSIX-compliant, real-time in, say, a device driver can still crash the entire operat-
operating system, generally used in embedded systems. ing system server. The only gain of this design from a
The core of VxWorks is indicated as the ‘Wind’ micro- reliability point of view is a faster reboot.
kernel, but, in fact, the kernel contains the operating sys-
tem, including device drivers, and thus has a monolithic
structure. VxWorks has historically provided only ker- 7.4 Multiserver Systems
nel mode, requiring users to develop exclusively in this
Multiserver operating systems distribute functionality
mode. Only recently, VxWorks AE provided a real-time
over multiple, isolated components. Although several
process model that enables memory protection between
multiserver systems exists, to the best of our knowledge,
processes. While this allows to run isolated user-mode
nobody has yet built and released a working, fully mod-
applications, it is up to the developer to enable the pro-
ular, open-source, multiserver operating system.
tection; ours is mandatory.
12
Singularity A recent multiserver system developed by Our system represents a new data point in the spec-
Microsoft Research is Singularity [9]. In contrast to trum from monolithic to fully modular structure. The
other systems, Singularity is based on language safety design of consists of a small kernel running the entire
and bypasses hardware protection offered by the MMU. operating system as a collection of independent, isolated,
The trusted base consists of parts of the kernel and run- user-mode processes. While people have tried to produce
time system that are not verifiably safe. The operating a fully modular microkernel-based UNIX clone with de-
system is run on top of the kernel as a set of verifiably- cent performance for years (such as GNU Hurd), we have
safe, software-isolated servers, each running under the actually done it, tested it heavily, and released it.
control of its own run-time system. A contract speci- The kernel implements only the minimal mechanisms
fies allowable interactions via state machine-driven IPC required to build an operating system upon. It provides
declarations. These contracts can be statically verified, IPC, scheduling, interrupt handling, and contains two
but are complex and hard to get correct without knowl- kernel tasks (SYS and CLOCK) to support the user-mode
edge of formal specifications. Building applications for operating system parts. The core servers are the process
Singularity means a paradigm shift for the programmer, manager (PM), memory manager (MM), file server (FS),
making it less suitable for large-scale adoption. reincarnation server (RS), and data store (DS). Since the
size of these components ranges from about 1000 to 4000
lines of code, they are easy to understand and maintain.
Symbian OS This operating system is designed for
small, handheld devices, especially mobile phones. Additional operating system services, such as device
Symbian shares many characteristics with monolithic drivers, the information server (IS), window system (X),
systems. For example, process management, memory and network server (INET), can be started on the fly and
management, device drivers, and dynamically loadable are guarded by RS. The system is robust and self heal-
modules all are implemented in the kernel. Only the ing, so that it can withstand and automatically recover
file server, and the networking and telephony stacks are from common failures in these components, transparent
hosted in user-mode servers. to applications and without user intervention.
13
View publication stats
14