0% found this document useful (0 votes)

23 views15 pages

The Design and Implementation of A Fully-Modular S

Uploaded by

khd1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views15 pages

The Design and Implementation of A Fully-Modular S

Uploaded by

khd1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228579596

The Design and Implementation of a Fully-Modular, Self-Healing, UNIX-Like

Operating System

Article

CITATION READS

1 56

5 authors, including:

Jorrit N. Herder Herbert Bos

Google Inc. Vrije Universiteit Amsterdam
23 PUBLICATIONS 527 CITATIONS 190 PUBLICATIONS 3,335 CITATIONS

SEE PROFILE SEE PROFILE

Philip Homburg Andrew S. Tanenbaum

Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam
37 PUBLICATIONS 1,241 CITATIONS 408 PUBLICATIONS 15,209 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Automating Live Update for Generic Server Programs View project

Exploit Mitigation View project

All content following this page was uploaded by Philip Homburg on 20 May 2014.

The user has requested enhancement of the downloaded file.

The Design and Implementation of a Fully-Modular,
Self-Healing, UNIX-Like Operating System
Technical Report IR-CS-020, February 2006

Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum
Dept. of Computer Science, Vrije Universiteit Amsterdam, The Netherlands
{jnherder, herbertb, bjgras, philip, ast}@cs.vu.nl

Abstract structure, combined with several explicit mechanisms for

transparent recovery from crashes and other failures, re-
In this paper, we discuss the architecture of a fully- sults in a highly-reliable, completely-multiserver operat-
modular, self-healing operating system, which exploits ing system that still looks and feels like UNIX.
the principle of least authority to provide reliability be- While some of the mechanisms are well-known, and
yond that of most other operating systems. The system multiserver operating systems have been first proposed
can be characterized as a minimal kernel with the entire years ago, to the best of our knowledge, we are the first
operating system running as a set of compartmentalized to explore such an extreme decomposition of the operat-
user-mode servers and drivers. ing system that is designed for reliability, while provid-
By moving most of the code to unprivileged user- ing reasonable performance. Quite a few ideas and tech-
mode processes and restricting the powers of each one, nologies have been around for a long time, but were often
we gain proper fault isolation and limit the damage bugs abandoned for performance reasons. We believe that the
can do. Moreover, the system has been designed to sur- time has come to reconsider the choices that were made
vive and automatically recover from failures in critical in common operating system design.
components, such as device drivers, transparent to appli-
cations and without user intervention.
We used this design to develop a highly-reliable, open- 1.1 Contribution
source, POSIX-conformant member of the UNIX family,
The contribution of this work is the design and imple-
which is freely available and has been downloaded over
mentation of an operating system that takes the concept
100,000 times in the past 3 months.
of multiserver to an extreme in order to provide a de-
pendable computing platform. The concrete goal of this
1 INTRODUCTION research is to build a UNIX-like operating system that
can transparently survive crashes of critical components,
Operating systems are expected to function flawlessly, such as device drivers.
but, unfortunately, most of today’s operating systems fre- As we mentioned earlier, the answer that we came up
quently fail. As discussed in Sec. 2, many problems stem with is to break the system into manageable units and
from the monolithic design that underlies most common rigidly control the power of each unit. The ultimate goal
systems. All operating system functionality, for exam- is that a fatal bug in, say, a device driver should not
ple, runs in kernel mode without proper fault isolation, crash the operating system; instead, the failed compo-
so that any bug can potentially trash the entire system. nent should be automatically and transparently replaced
Like other groups, we believe that reducing the oper- by a fresh copy, and running user processes should not
ating system kernel is a first important step in the direc- be affected. No existing system has this property.
tion of designing for reliability. In particular, running To achieve this goal, our system provides: simple,
drivers and other core components in user mode helps yet efficient and reliable IPC; disentangling of interrupt
to minimize the damage that may be caused by bugs in handling from user-mode device drivers; separation of
such code. However, our system explores an extreme in policies and mechanisms; flexible, run-time operating
the design space of UNIX-like operating systems where system configuration; decoupling of servers and drivers
the entire operating system is run as a collection of in- through a publish-subscribe system; and error detection
dependent, tightly-restricted, user-mode processes. This and transparent recovery for common drivers failures.

1
We believe that we are the first to realize a fully- 85% of all operating system crashes are caused by de-
modular, open-source, POSIX-conformant operating vice drivers [2, 20], running untrusted, third-party code
system with self-healing properties. Although we pri- in the kernel also diminishes the system’s reliability.
marily use a multiserver architecture because of its re- From a high-level reliability perspective, a monolithic
liability, we will show that the system has many other kernel is unstructured. The kernel may be partitioned
benefits as well, for example, for system administration into domains but there are no protection barriers en-
and programming. The system has been released (with forced between the components. Two simplified exam-
all the source code) and over 100,000 people have down- ples, Linux and MacOS X, are given in Fig. 1.
loaded it so far, as discussed later.

User
User User User User User User User User

1.2 Paper Outline

Kernel space
VFS Inet UNIX server
We first introduce how operating system structures have
evolved over time (Sec. 2). Then we proceed with a de- Drivers Paging Microkernel
tailed discussion of the kernel (Sec. 3) and the organiza-
tion of the user-mode servers on top of it (Sec. 4). We Linux Mac OS X
review how our multiserver operating system realizes a (a) (b)
dependable computing platform and highlight some ad-
ditional benefits (Sec. 5), and briefly discuss its perfor- Figure 1: Two typical monolithic systems: (a) Vanilla Linux
mance (Sec. 6). In the end, we survey related work and (b) Mac OS X. Their properties are discussed in Sec. 2.1.
(Sec. 7), and draw conclusions (Sec. 8).

2.2 Single-Server Systems

2 OPERATING SYSTEM DESIGN
A single-server system has a reduced kernel, and runs a
This section illustrates the operating system design space large fraction of the operating system as a single, mono-
with three typical structures and some variants thereof. lithic user-mode server. In terms of reliability, this setup
While most structures are probably familiar to the reader, adds little over monolithic operating systems, because
we introduce them explicitly to show an overview of the there still is a single point of failure. The only gain in
design space that has monolithic systems at one extreme case of an operating system crash is a faster reboot.
and ours at the other. Concrete systems are discussed and An advantage of this setup is that it preserves a UNIX
compared to our system in Sec. 7. environment while one may experiment with a micro-
It is sometimes said that virtual machines and exoker- kernel approach. Legacy applications targeted towards
nels provide sufficient isolation and modularity for mak- the monolithic operating system server can coexist with
ing a system safe. However, these technologies provide novel applications. The combination of legacy applica-
an interface to an operating system, but do not represent tions and real-time or secure components allows for a
a complete system by themselves. The operating system smooth transition to a new computing environment.
running on top of a virtual machine or exokernel can have Mach-UX [1] was one of the first systems to run BSD
any of the following structures. UNIX in user-mode on top of the Mach 3 microker-
nel, as shown in Fig. 2(a). Another example, shown in
Fig. 2(b), is Perseus [13], running Linux and some spe-
2.1 Monolithic Systems cialized components on top of the L4 microkernel.
Monolithic kernels provide rich and powerful abstrac-
User space

tions of the underlying hardware. All operating system User User User User User User User User

services are provided by a single, monolithic program

that runs in kernel mode; applications run in user mode BSD Unix L4 Linux GUI
and can request services directly from the kernel.
Kernel space

Drivers
Monolithic designs have some inherent problems that Sign
Drivers Paging
affect their reliability. All operating system code, for ex-
ample, runs at the highest privilege level without proper Mach 3 L4
fault isolation, so that any bug can potentially trash the (a) (b)
entire system. With millions of lines of code (LoC)
and 1-16 bugs per 1000 LOC [22, 23], monolithic sys- Figure 2: Two typical single-server systems: (a) Mach-UX
tems are likely to contain many bugs. Since 70% to and (b) Perseus. Their properties are discussed in Sec. 2.2.

2
2.3 Multiserver Systems 3 THE KERNEL ARCHITECTURE
In a multiserver design, the operating system environ-
The kernel is responsible for low-level functionality that
ment is formed by a set of cooperating servers. Un-
cannot be handled in user space, such as IPC, process
trusted, third-party code such as device drivers can be run
scheduling, and interrupt handling. Ours consists of
in separate, user-mode modules to prevent faults from
fewer than 4000 lines of code (LoC), which makes it
spreading. High reliability can be achieved by applying
easy to understand. The kernel provides only the most
the principle of least authority [15], and tightly control-
elementary mechanisms, whereas the user-mode servers
ling the powers of each module.
implement the policies that drive the operating system.
A multiserver design also has other advantages. The
modular structure, for example, makes system admin-
istration easier and provides a convenient programming 3.1 Interprocess Communication
environment, as discussed in Sec. 5.
Several multiserver operating systems exist. An early Interprocess communication (IPC) is of crucial impor-
system is MINIX [21], which distributed operating sys- tance in a multiserver system. IPC allows user processes
tem functionality over two user-mode servers, but still to request operating system services and enables cooper-
ran the device drivers in the kernel, as shown in Fig. 3(a). ation between servers and drivers. We compared many
More recently, IBM Research designed SawMill alternatives to find a suitable set of IPC primitives that
Linux [4], a multiserver environment on top of the L4 mi- is simple, efficient, and reliable. Finding IPC primi-
crokernel, as illustrated in Fig. 3(b). While the goal was a tives that do not hang the system when message senders
full multiserver variant of Linux, the project never passed and receivers crash during a request-reply sequence is far
the stage of a rudimentary prototype, and was then aban- from trivial. Consequently, our primitives have evolved
doned when the people working on it left IBM. as the system matured. In this paper we describe the
primitives of the version currently in test; these differ in
User space

User User User User User User User User minor ways from those in previous versions.
Our IPC communication is characterized by ren-
Mem FS Net FS Mem dezvous message passing using small, fixed-length mes-
sages. Rendezvous is a two-way interaction without
Kernel space

Driver Driver Name intermediate buffering. The interaction is fully syn-

Driver Driver
chronous, which means that the first process that is ready
MINIX L4 to interact blocks and waits for the other. When both pro-
(a) (b) cesses are ready, the message is copied from the sender
to the receiver and both resume execution.
Figure 3: Two typical multiserver systems: (a) MINIX and (b) Although rendezvous message passing is easier to im-
SawMill Linux. Their properties are discussed in Sec. 2.3. plement than a buffered scheme, it is less flexible and
sometimes even inadequate. Some events are inherently
asynchronous or need to be communicated to higher-
2.4 Designing for Reliability level process without risking a deadlock. Therefore, we
complemented the synchronous primitives with a simple
Although several multiservers systems exist, either in de- nonblocking event notification mechanism.
sign or in prototype implementation, none of them were
designed with the explicit goal of being highly reliable.
In the rest of this paper, we present a new, fully-modular, IPC Primitives The kernel supports four IPC prim-
highly-reliable, open-source, POSIX-conformant multi- itives. The standard primitives to send messages are
server operating system that is freely available for down- IPC REQUEST and IPC REPLY, typical in synchronous
load, and has been widely tested. client-server communication. A request will block the
Some commercial systems like Symbian OS and caller until the reply has been received. The message
QNX [8] are also based on multiserver designs. How- type and arguments can be freely set by the caller, but the
ever, they are proprietary and distributed without source source is reliably patched into the message by the kernel.
code, so it is difficult to verify the claims. Still, the In addition, the nonblocking IPC NOTIFY primitive
fact that innovating companies use multiserver designs, can be used to send notification messages. This is used
demonstrates the viability of the approach. to pass kernel events, user-configurable events, or call-
We will now discuss our multiserver architecture in back requests. Normally only the event set is passed, but
detail, and show why it is a reliable system and how it the kernel also includes the message source for callback
can survive crashes of operating system components. requests, so that the receiver can query the sender.

3
Finally, the IPC SELECT primitive can be used to re- send its requests to the file server instead. These restric-
ceive a specified message. The caller can pass the mes- tions also help to prevent deadlocks.
sage source or set of events it is interested in, and will be Third, we restrict the use of event notifications. Only
blocked until such a message arrives. trusted processes, such as the process manager and file
Messages are prioritized. Event notifications, such as server, can use them. Callback events, in contrast, are
hardware interrupts and timeouts, have the highest prior- also available to untrusted processes, such as drivers.
ity. Callback notifications have a lower priority, as they All these protection mechanisms are implemented by
cannot be delivered together with other notifications. Fi- means of bitmaps that are statically declared as part of
nally, request messages have the lowest priority. the process table. This is space efficient, prevents re-
source exhaustion, and allows for fast permission checks
since only simple bit operations are required.
Design Principles We designed the IPC primitives to
be simple, efficient, and reliable to reduce the amount of
code and increase understandability. Complicated opti- 3.2 Process Scheduling
mizations to improve resource usage often lead to com-
Scheduling is done using a fixed number of prioritized
plex, buggy code, so we have tried to keep the code
queues. The processes on each queue are kept in a linked
straightforward. For example, only small, fixed-length
list, and are scheduled round robin. The scheduler sim-
messages are used. Messages are a union of different
ply finds the highest populated queue and selects the first
message types, and their size is determined at compile
process on it to run.
time as the largest of all types in the union.
Whenever a process becomes ready, it is put on the
To ensure messages are reliably delivered to the right
head of its queue when it still has some quantum left. A
destination, IPC endpoints are under the control of the
process goes to the rear of its queue only when it has no
kernel. An IPC endpoint is formed by the combination of
quantum left. While somewhat counterintuitive, it works
a process’ slot number and the slot’s generation number,
because this ensures that processes doing a system call
which increased with each new process. This ensures
are not moved to the rear of the queue. It also makes the
that IPC directed to an exited process cannot end up at a
system responsive since processes that were blocked for
process that reuses a slot.
I/O can immediately run once the I/O is done.
Our IPC design eliminates the need for dynamic re-
Each time a process consumes a full quantum it de-
source allocation, both in the kernel and in user space.
grades in priority. Periodically the priority of all pro-
The standard request-reply sequence uses a rendezvous,
cesses is upgraded to prevent that all processes end up
so that no message buffering is needed. If the destination
in the lowest-priority queue. Since I/O bound processes
is not waiting, IPC REQUEST blocks the sender. Simi-
consume fewer quanta than CPU-bound processes, they
larly, a receiver is blocked on IPC SELECT when no IPC
will have a higher average priority, and are likely to be
is available. Messages are never buffered in the kernel,
scheduled when the I/O finishes.
but always directly copied from sender to receiver. No
additional copies are required, speeding up IPC.
The asynchronous IPC NOTIFY mechanism is also not 3.3 Interrupt Handling
susceptible to resource exhaustion. Event notifications
Another important responsibility of the kernel is inter-
are typed and at most one bit per type is saved. All pend-
rupt handling. Because this cannot be done in user space,
ing notifications can be stored in a compact bitmap that is
we disentangled interrupt handlers and device drivers.
statically declared as part of the process table. Multiple
User-mode device drivers can only instruct the kernel to
pending notifications of the same type are merged. Al-
transform specific interrupts into notification messages
though the amount of information that can be passed this
and must do all further processing themselves. After reg-
way is limited, this design was chosen for its simplicity,
istration, drivers can tell the kernel to enable and disable
reliability and low memory requirements.
hardware interrupts.
The kernel catches all hardware interrupts with a
Protection Mechanisms Since IPC is a powerful con- generic interrupt handler that looks up which drivers are
struct, we included several mechanisms to restrict who associated with the IRQ line, and sends a nonblocking
can do what. First, we restrict the set of IPC primitives notification message to each of them. As a side-effect,
available to each process. User processes, for example, the generic interrupt handler gathers randomness for the
are allowed to use only IPC REQUEST. random number generation device.
Second, we restrict who can request services from The only exception to the above is that the clock driver
whom. A user process doing I/O, for example, cannot (CLOCK) defines its own handler as part of the kernel,
communicate directly with device drivers, but needs to as discussed below. This handler is simple and usually

4
does only accounting. When more work is needed, such interrupt occurs, the currently running kernel task or user
as scheduling another process, a notification is sent to process is preempted, the interrupt is transformed into a
CLOCK for further processing. message, and another process is scheduled. If the inter-
All the real work is done at the process level, usually in rupt is for a high-priority device, the associated driver
a user-mode device driver, but sometimes also in a kernel is likely to be scheduled soon. Building a complete
task. This helps to achieve a low interrupt latency—since real-time operating system would require extending the
processes can be preempted—and makes the system suit- scheduler with real-time primitives.
able for real-time applications.

3.5 The Kernel Tasks

3.4 Kernel Model
The kernel contains two independently-scheduled pro-
Our kernel design is different from most UNIX-like sys- cesses called tasks (to distinguish them from the user-
tems. In contrast to ordinary UNIX kernels, ours cleanly mode OS components). They are in kernel address space
follows the process-oriented multiserver design, as illus- and run in kernel mode because they perform privileged
trated in Fig. 4. It provides two kernel-mode servers, operations that cannot be done in user space in a portable
SYS and CLOCK, also known as kernel tasks, with spe-
way, such as device I/O.
cial privileges to support the user-mode operating system
servers. Although tasks share the kernel’s address space
and run in kernel mode, they are treated like normal pro- 3.5.1 System Task (SYS)
cesses.
SYS is the interface to the kernel for all user-mode
servers and drivers. All kernel calls in the system li-
User space

Proc Proc
Proc
brary are transformed into request messages that are sent
to SYS, which processes the requests, and sends reply
messages. SYS never takes initiative by itself, but it is
Kernel space

always blocked waiting for a new request message.

KERNEL

SYS CLK The kernel calls handled by SYS can be grouped into
several categories, including process management, mem-
ory management, copying data between processes, de-
Figure 4: The kernel closely follows the process-oriented mul- vice I/O and interrupt management, access to kernel data
tiserver design. Services can be requested with ordinary, syn- structures, and clock services. An overview of common
chronous request messages. Kernel events are transformed into
kernel calls is given in Fig. 5.
asynchronous notification messages.
Kernel Call Purpose
Apart from the actual service requested, a kernel call SYS FORK Fork a process; copy parent slot
from a user-mode server to a kernel task is similar to a SYS EXEC Execute a process; initialize slot
system call from a user process to an operating system SYS EXIT Exit a process; clear process slot
server. Both calls use the IPC primitives discussed in SYS NEWMAP Assign memory segment to process
Sec. 3.1 and result in a synchronous request message be- SYS VIRCOPY Copy data using virtual addressing
ing sent from one process to another. The services pro- SYS DEVIO Read or write a single I/O port
vided by the kernel tasks are discussed below. SYS IRQCTL Set or reset an interrupt policy
Only a tiny fraction of the kernel is responsible for SYS PRIVCTL Assign system process’ privileges
SYS GETINFO Get a copy of kernel information
handling hardware interrupts, IPC traps, and exceptions.
SYS TIMES Get process times or kernel uptime
Whenever such an event happens the CPU saves the state
SYS SETALARM Set or reset a synchronous alarm
of the currently-running process, and invokes the associ-
ated service routine that has been registered by the ker-
nel. When the event has been processed the kernel picks Figure 5: A selection of common kernel calls. All calls require
a (possibly different) process to run, restores the process’ privileged operations and are handled by SYS.
state, and tells the CPU to resume normal execution. To
keep the kernel simple, kernel reentries are forbidden.
3.5.2 Clock Task (CLOCK)
Real-Time Properties Because the kernel is process
oriented, it forms a suitable base for real-time systems. CLOCK is responsible for accounting of CPU usage,
The kernel is locked only when this is absolutely re- scheduling another process when a process’ quantum
quired to prevent race conditions. Whenever a hardware expires, managing watchdog timers, and interacting

5
with the hardware clock. It does not have a publicly- User User User
accessible user interface like SYS.
OS
When the system starts up, CLOCK programs the hard-

User space
Interface
FS PM RS
ware clock’s frequency and registers an interrupt handler
that is run on every clock tick. The handler does only
basic integer operations, that is, it increments a process’ driver MM DS

user or system time and decrements the scheduling quan-

tum. If the a new process must be scheduled or an alarm

Kernel space

KERNEL
is due, a notification is sent to CLOCK to do the real work
SYS CLK
at the task level. This minimizes the hardware interrupt
latency, because kernel tasks can be preempted.
Although CLOCK has no direct interface from user
Figure 6: The core components of the full multiserver operat-
space, its services can be accessed through the ker- ing system, and some typical IPC paths. Top-down IPC uses
nel calls handled by SYS. The most important call is synchronous requests, whereas bottom-up IPC is done with
SYS ALARM that allows system processes to schedule asynchronous notifications.
a synchronous alarm that causes a ‘timeout’ notifica-
tion upon expiration. The alarm is synchronous because
CLOCK the notification message is only delivered when For character devices the user process may be suspended
the client indicates it is ready to receive it. until the driver notifies FS that the data is ready.
As an aside, the POSIX alarm that is available to or- (3) Additional servers and drivers can be started on the
dinary user applications is handled by the user-mode fly by requesting the reincarnation server (RS). RS then
process manager server, and causes an asynchronous forks a new process, assigns all needed privileges, and,
SIGALRM signal. finally, executes the given path in the child process (not
shown in the figure). Information about the new system
4 THE USER-MODE SERVERS process is published in the data store (DS), which allows
parts of the operating system to subscribe to updates in
On top of the kernel we have implemented a multiserver the operating system configuration.
operating system. The core components of this sys- (4) Although not a system call, it is interesting to
tem are shown in Fig. 6. Apart from the device drivers see what happens if a user or operating system pro-
and user processes this constitutes the trusted computing cess causes an exception, for example, due to an invalid
base. Most of the servers are relatively small and sim- pointer. In this event, the kernel’s exception handler no-
ple. The sizes approximately range from 1000 to 3000 tifies PM, which transforms the exception into a signal or
LoC per server, which makes them easy to understand kills the process when no handler is registered. Recovery
and maintain. The components are discussed below. in case of operating system failures is discussed below.
We first give some examples to illustrate how our mul-
tiserver operating system actually works. Fig. 6 also
shows some typical IPC interactions initiated by user Design Principle The general design principle that led
processes. Although the POSIX operating system inter- to the above set of servers is that each process should be
face is implemented by multiple servers, system calls are limited to its core business. Having small, well-defined
transparently targeted to the right server by the system services helps to keep the implementation simple and
libraries. Four examples are given below: understandable. As in the original UNIX philosophy,
(1) The user process that wants to create a child pro- each server has limited responsibility and power, as is
cess calls the fork() library function, which sends a re- reflected in its name.
quest message to the process manager (PM). PM verifies For example, FS must interact with drivers, but should
that a process slot is available, asks the memory man- not be checking for weird driver failures like non-
ager (MM) to allocate memory, and instructs the kernel responsiveness. Although FS could manage driver time-
to create a copy of the process. outs itself, this would complicate its design and imple-
(2) A read() or write() call, in contrast, is sent to FS. mentation. Therefore, it relies on a separate component,
If the requested block is available in the buffer cache, RS, which is responsible for the system’s well-being. If
FS asks the kernel to copy it to the user. Otherwise it first FS hangs on a driver, RS will detect that the driver is
sends a message to the disk driver asking it to retrieve the not responding, and will kill the driver and revive FS.
block from disk. The driver sets an alarm, commands the Although killing a nonresponsive driver seems harsh, a
disk controller through a device I/O request to the kernel, properly designed driver must adhere to the protocol and
and awaits the hardware interrupt or timeout notification. return an error code to FS if it cannot fulfill the request.

6
Multiserver Protocol Although our IPC facilities are 4.1.2 File Server (FS)
fairly reliable by design, as discussed in Sec. 3.1, they
cannot prevent deadlocks. Therefore, we devised a mul- FS manages the file system. It is an ordinary file server
tiserver protocol that ensures that synchronous messages that handles standard POSIX calls such as open(), read(),
are sent in one only direction. There is a loose layering and write(). More advanced functionality includes sup-
based on who can make requests to whom. User pro- port for symbolic links and the select() system call. FS is
cesses can call the servers, servers can call each other also the interface to the network server.
and drivers, and servers and drivers can call the kernel. For performance reasons, file system blocks are
IPC in the opposite direction is done using the nonblock- buffered in FS’ buffer cache. To maintain file system
ing notification mechanism. consistency, however, crucial file system data structures
Another aspect of the multiserver protocol is that we use write-through semantics, and the cache is periodi-
try to minimize copying to prevent loss of performance. cally written to disk.
The number of copies required for I/O is precisely the Since the file server runs as an isolated process that
same as in a monolithic system. Although there are more is fully IPC driven, it can be replaced with a different
context switches because a user process, the file server, one to serve other file systems, such as FAT. Moreover, it
and a driver must interact, no intermediate copies are should be relatively straightforward to transform FS into
required. For example, I/O for character devices is no a network file server that runs on a remote host.
buffered. All data is directly copied between the user
process and the device driver.
Device Driver Handling Because device drivers can
be dynamically configured, FS maintains a table with
4.1 Core Components the mapping of major numbers onto specific drivers. As
discussed below, FS is automatically notified of changes
This section discusses the core components shown in
in the operating system configuration through a publish-
Fig. 6. We will focus on design decisions that are spe-
subscribe system. This decouples the file server and the
cific to the multiserver aspects of the system.
drivers it depends on.
A goal of our research is to automatically recover
4.1.1 Process Manager (PM) from common driver failures without human interven-
PM is responsible for process management such as cre- tion. When a disk driver failure is detected, the sys-
ating and removing processes, assigning process identi- tem can recover transparently by replacing the driver and
fiers and priorities, and controlling the flow of execution. rewriting the blocks from FS’ buffer cache. For character
Furthermore, PM maintains relations between processes, devices, transparent recovery sometimes is also possible.
such as process groups and parent-child blood lines. The Such failures are pushed to user space, but may be dealt
latter, for example, has consequences for exiting pro- with by the application if the I/O request can be reissued.
cesses and accounting of CPU time. A print job, for example, can be reissued by the print
Although the kernel provides mechanisms, for exam- spooler system.
ple, to set up the CPU registers, PM implements the pro-
cess management policies. As far as the kernel is con-
4.1.3 Memory Manager (MM)
cerned all processes are similar; all it does is schedule
the highest-priority ready process. To facilitate ports to different architectures, we use
a hardware-independent, segmented memory model.
Signal Handling PM is also responsible for POSIX Memory segments are contiguous, physical memory ar-
signal handling. When a signal is to be delivered, by eas. Each process has a text, stack, and data segment.
default, PM either ignores it or kills the process. Ordi- System processes can be granted access to additional
nary user processes can register a signal handler to catch memory segments, such as the video memory or the
signals. In this case, PM interrupts pending system calls, RAM disk memory. Although the kernel is responsible
and puts a signal frame on the stack of the process to run for hiding the hardware-dependent details, MM does the
the handler. This approach is not suitable for system pro- actual memory management.
cesses, however, as it interferes with IPC. Therefore, we MM maintains a list of free memory regions, and can
implemented an extension to the POSIX sigaction() call allocate or release memory segments for other system
so that system processes can request PM to transform sig- services. Currently MM is integrated into PM and pro-
nals into notification messages. Since event notification vides support for Intel’s segmented memory model, but
messages have the highest priority of all message types, work is in progress to split it out and offer limited virtual
signals are delivered promptly. memory capabilities, for example, shared libraries.

7
We will not support demand paging, however, because ponent, which is useful, for example, for development
we believe physical memory is no longer a limited re- purposes. Another policy might use a binary exponen-
source in most domains. We strive to keep the code sim- tial backoff protocol when restarting components to pre-
ple and eliminate complexity whenever possible. Swap- vent clogging the system due to repeated failures. In any
ping segments to disk would be easy to add, but in the event, the problems are logged so that the system admin-
interest of simplicity we have not done so. istrator can always find out what happened. Optionally,
an e-mail can be sent to a remote administrator.
4.1.4 Reincarnation Server (RS) Failed components can be restarted from a fresh copy
on disk except for the disk driver, which is restarted from
RS is the central component responsible for managing a copy kept in RAM.
all operating system servers and drivers. While PM is
responsible for process management in general, RS deals
with only privileged processes. It acts as a guardian and 4.1.5 Data Store (DS)
ensures liveness of the operating system.
Administration of system processes also goes through DS is a small database server with publish-subscribe
RS. A utility program, service, provides the user with a functionality. It serves two purposes. First, system pro-
convenient interface to RS. It allows the system adminis- cesses can use it to store some data privately. This redun-
trator to start and stop system services, (re)set their poli- dancy is useful in the light of fault tolerance. A restarting
cies, or gather statistics. For optimal flexibility in spec- system service, for example, can request state that it lost
ifying policies a shell script can be set to run on certain when it crashed. Such data is not publicly accessible.
events, including device driver crashes. Second, the publish-subscribe mechanism is the glue
between operating system components. It provides a
Fault Set The fault set that RS deals with are protocol flexible interaction mechanism and elegantly reduces de-
errors, transient failures, and aging bugs. Protocol errors pendencies by decoupling producers and consumers. A
mean that a system process does not adhere to the mul- producer can publish data with an associated identifier.
tiserver protocol, for example, by failing to respond to a A consumer can subscribe to selected events by specify-
request. Transient failures are problems caused by spe- ing the identifiers or regular expressions it is interested
cific configuration or timing issues that are unlikely to in. Whenever a piece of data is updated DS automat-
happen. Aging bugs are implementation problems that ically broadcasts notifications to all dependent compo-
cause a component to fail over time, for example, when nents. Although we currently do not do this, in future,
it runs out of buffers due to memory leaks. drivers could announce every request message and I/O
Logical errors where a server or driver perfectly ad- completion to DS. In this manner, if a driver crashes, its
heres to the specified system behavior but fails to per- replacement could find out what work was pending, sim-
form the actual request are excluded. An example of a ilar to the shadow drivers in Nooks [19].
logical error is a printer driver that accepts a print job and
confirms that the printout was successfully done, but, in
fact, prints garbage. Such bugs are virtually impossible Naming Service IPC endpoints are formed by the pro-
to catch in any system. cess and generation numbers, which are controlled and
managed by the kernel. Because every process has a
Fault Detection and Recovery During system initial- unique IPC endpoint, system processes cannot easily find
ization RS adopts all processes in the boot image as its each other. Therefore, we introduced stable identifiers
children. System processes that are started later, also be- that consist of a natural language name plus an optional
come children of RS. This ensures immediate crash de- number. The identifiers are globally known. Whenever
tection, because PM raises a SIGCHLD signal that is de- a system process is (re)started RS publishes its identifier
livered at RS when a system process exits. and the associated IPC endpoint at DS for future lookup
In addition, RS can check the liveness of the system. by other system services.
If the policy says so, RS does a periodic status check, In contrast to earlier systems, such as Mach, our nam-
and expects a reply in the next period. Failure to respond ing service is a higher-level construction that is realized
will cause the process to be killed. The status requests in user space. Mach provided stable IPC endpoints in the
and the consequent replies are sent using a nonblocking kernel, namely the ‘port’ mechanism, to which a client
event notification. and server could attach. This required bookkeeping in
Whenever a problem is detected, RS can replace the the kernel and did not solve the problems introduced by
malfunctioning component with a fresh copy. The asso- exiting and reappearing system services. We have inten-
ciated policy script, however, might not restart the com- tionally pushed all this complexity to user space.

8
Error Handling Since fault tolerance is an explicit de- structures at any time. Examples include the process ta-
sign goal, the naming service is an integral part of the de- ble of the kernel or PM, the device driver mappings at FS,
sign. The publish-subscribe mechanism of DS makes it and the status of privileged processes at RS.
very suitable to inform other processes of changes in the
operating system. Moreover, recovery of, say, a driver is 4.2.2 Network Server (INET)
made explicit to the services that depend on it.
For example, FS subscribes to the identifier for the disk The network server, INET, implements TCP/IP in user
drivers. When the system configuration changes, DS no- space. The interface offered to the application program-
tifies FS about the event. FS then calls back to find out mer are BSD sockets. As with other I/O handles, sockets
what happened. If FS discovers that a driver has been are managed by the file server, but FS transparently for-
restarted, it tries to recover transparently to the user. wards networking requests to INET, which manages both
TCP streams and UDP datagrams.
4.1.6 Device Drivers Like FS, INET requests DS to be notified about con-
figuration of Ethernet drivers, and can handle driver
All operating systems hide the raw hardware under a crashes. Since the TCP protocol prescribes retransmis-
layer of device drivers. Consequently, we have imple- sions of lost packets (and lost datagram are explicitly al-
mented drivers for ATA, S-ATA, floppy, and RAM disks, lowed by the UDP protocol), INET can fully recover from
keyboards, displays, audio, printers, serial line, various Ethernet driver failures transparent to the user.
Ethernet cards, etc.
Although device drivers can be very challenging, tech-
4.2.3 X Window System (X)
nically, they are not very interesting in the operating sys-
tem design space. What is important, though, is each of To demonstrate that our ideas are practical in real UNIX-
ours runs as an independent user-mode process to prevent like systems, we have ported a recent version of the X
faults from spreading outside its address space and make Window System. X provides a client-server interface
it easy to replace a crashed or looping driver without a between display hardware and the desktop environment.
reboot. This is the self-healing property referred to in the We have successfully run large GUI applications, includ-
title. While other people have measured the performance ing the Firefox browser, over a network. A downside of
of user-mode drivers [10], no currently-available system X is that it is a large, monolithic window system, but it
is self-healing like this. clearly shows that our system can run real-world soft-
We are obviously aware that not all bugs can be elim- ware. Future work might include porting a small, modu-
inated by restarting a failed driver, but since the bugs lar window system.
that make it past driver testing tend to be timing bugs
or memory leaks rather than algorithmic bugs, a restart
often does the job. 5 BENEFITS OF THIS DESIGN
Although the main reason for using a multiserver archi-
4.2 Optional Components tecture is achieving extremely high reliability, there are
In addition to the core components discussed above, sev- other advantages of this approach. In this section, we will
eral operating system services are started on the fly with highlight some benefits of our system for programmers,
help of RS. The most important ones are discussed here. system administrators, and end users.

4.2.1 Information Server (IS) 5.1 Programming Environment

Debugging dumps are centrally managed by IS. Instead With the multiserver design in place, development of
of polluting each server with a mechanism to make debug new components is much easier due to a shortened de-
dumps, servers only have to implement a method to allow velopment cycle. With help of RS, drivers can be started
IS to retrieve a copy of their data structures, and IS han- and tested just like other user programs, lengthy builds of
dles all user interaction. If a function key is pressed, the the entire operating system are not needed, and no reboot
associated data structure is copied to IS’ memory space, a of the system is required to test a new driver. This may
formatted dump is displayed on the primary console, and increase programmer productivity, and lead to a shorter
the data are automatically logged for later inspection. time-to-market.
Although IS is an optional component, it is auto- Since drivers run as isolated user-mode processes, the
matically started by the startup scripts for convenience. consequences of a bad driver are limited and debugging
To encourage development and experimentation, IS pro- becomes easier. After a driver crash, RS simply logs the
vides screendumps of the system’s most crucial data problem, and the developer can continue immediately.

9
No need for a reboot and a check of the file system. ample, in the design of our device drivers we use shared
Crashes result in core dumps that can be inspected and library code. This helps to improve reliability since the
debugged using normal tools. code is thoroughly tested by the many drivers that use it.
The programming model is more convenient, because We also postpone initialization of drivers until their first
the user-mode programming model is much closer to use so that they cannot hang the system at boot time.
the POSIX standard than the restricted kernel API. This The use of separate memory segments protects against
lowers the barrier for experimentation by new users and many types of buffer overruns. Code injection is no
might improve driver quality. Furthermore, user-mode longer possible because the text segment is read-only and
drivers enforce a proper design respecting interfaces, the stack and data segment are not executable. Conse-
which leads to cleaner code. quently, even if a buffer overrun injects a worm or virus
Finally, there is good accountability. RS logs all com- onto the stack, this code cannot be executed. While other
ponent crashes so that it is clear where the error occurred. types of attacks exist, for example, the return-to-libc at-
This might have legal liability implications for the devel- tack, they are harder to exploit.
oper, and lead to more carefully crafted drivers.

5.2 System Administration Problem Detection Since all operating system ser-
vices run as separate user-mode process, we can de-
The multiserver approach also makes system administra- tect many problems, just like we can for ordinary appli-
tion much easier due to the presence of many small, well- cations. Invalid pointers or illegal access attempts are
understood, self-contained modules instead of a massive caught by MMU and will cause a signal from PM. The
kernel. As mentioned above, the kernel consists of less scheduler’s feedback mechanism tames infinite loops by
than 4000 LoC and the core servers are even smaller. lowering a process’ priority. Furthermore, a process’ se-
This improves the system’s maintainability, because the curity policy is checked whenever it makes a system call.
components are easy to understand and can be main-
tained independently from each other, as long as the in-
terfaces between them are respected. Security Policies Each user, server, and driver process
Because the operating system can be dynamically con- has an associated policy that specifies what it can do.
figured, a system administrator can quickly respond to Only the minimal privileges needed to perform the task
security breaches. Instead of applying a patch and re- are given, according to the principle of least authority. In
booting the system, a malfunctioning device driver can contrast, in a monolithic operating system it usually is
be replaced with a new one on the fly. This allows up- not possible to precisely restrict individual components.
dates without loss of service or downtime. Driver access to I/O ports can be limited by a range
stored in the kernel’s process table. In this way, if, say,
Configurability The multiserver model makes it easy the printer driver tries to write to the disk’s I/O ports, the
to configure a system. The core components are always kernel can prevent the access. Stopping rogue DMA is
present to provide the basic operating system services, not possible with current hardware, but as soon as an I/O
but optional components can be installed and loaded at a MMU is added, we can prevent that, too.
later time, without having to reboot the system. Furthermore, we can tightly restrict the IPC capabili-
Small consumer appliances, such as mobile phones ties of each process, as discussed in Sec. 3.1. User appli-
and PDAs, and embedded systems need a small, con- cations, for example, can use only IPC REQUEST and to
figurable operating system. Trying to squeeze a large only a subset of the operating system servers.
monolithic system like Linux into a small device requires
much more effort than mixing and matching parts from a
modular system that started small. Fault Detection and Recovery Sec. 4.1 discussed how
RS deals with crashes at the operating system level. RS
5.3 End-User Reliability provides immediate crash detection and does periodic
status checks if the policy says so. Depending on the
In the previous sections we already discussed how multi- policy, failing components are automatically restarted.
ple, independent servers and drivers help to improve the If a block device driver fails FS can provide transparent
end-user reliability of an operating system. Our system’s recovery to the application level by flushing its buffer
major reliability features are briefly summarized below. cache. For character device drivers the error is pushed to
the user level, where transparent recovery is sometimes
Structural Measures The system is designed to pre- possible. Typically, daemons need to be rewritten to retry
vent problems from occurring in the first place. For ex- instead of giving up on the first failure.

10
6 PERFORMANCE 7.1 Virtual Machines and Exokernels
Virtual machines [16] and exokernels [3] do not provide
Multiserver systems based on microkernels have been
a hardware abstraction layer like (other) operating sys-
criticized for decades because of alleged performance
tems. Instead, they respectively duplicate or partition
problems. We argue that a modular system need not
available hardware resources so that multiple operating
be slow due to additional copying and context-switching
systems can run next to each other with the illusion of
overhead introduced when multiple servers cooperate to
having a private machine. A virtual machine monitor
perform a task. While this was the case for some early
or exokernel runs in kernel mode and is responsible for
microkernel systems, current multiserver systems have
the protection of resources and multiplexing hardware re-
competitive performance.
quests, whereas each operating system runs in user mode,
To illustrate the case, BSD UNIX on top of the early
fully isolated from each other. These technologies pro-
Mach microkernel was well over 50% slower than the
vide an interface to an operating system, but do not repre-
normal version of BSD UNIX, and led to the impres-
sent a complete system by themselves. Neither approach
sion of microkernels being slow. Modern microkernels,
solves the problem we try to tackle: how to build a reli-
however, have proven that high performance actually can
able operating system that can heal itself after a fatal bug
be realized. L4 Linux on top of L4, for example, has
has been triggered in a device driver or server.
a performance loss of about 5% [6]. Another project
recently demonstrated that a user-mode gigabit Ether-
net can achieve the same performance as a kernel-mode 7.2 Monolithic Systems
driver up to 750 Mbps. Above that throughput for the
A monolithic system runs the entire operating system in
user-mode driver dropped by 7% [10].
kernel mode without proper fault isolation. Although
We have done extensive measurements of our system
these properties negatively affect the system’s reliability,
and presented the results in a technical report [7]. We can
as discussed in Sec. 2.1, many operating systems have a
summarize these results (done on a 2.2 GHz Athlon) as
monolithic design.
follows. The simplest system call, getpid, takes 1.011 mi-
croseconds, which includes two messages and four con-
text switches. Rebuilding the full system, which is heav- Windows XP and Vista Microsoft Windows XP is an
ily disk bound, has an overhead of 7%. Jobs with mixed example of a monolithic system that runs the entire op-
computing and I/O, such as sorting, sedding, grepping, erating system in kernel mode. Although Microsoft once
prepping, and uuencoding a 64-MB file have overheads tried to put some components to user space, the perfor-
of 4%, 6%, 1%, 9%, and 8%, respectively. The system mance penalty was deemed too high.
can do a build of the kernel and all user-mode servers and Faster hardware and consumer demands for reliabil-
drivers in the boot image within 6 sec. In that time it per- ity made Microsoft revisit this design decision. In Vista,
forms 112 compilations and 11 links (about 50 msec per Microsoft has plans to run many device drivers and the
compilation). Fast Ethernet easily runs at full speed, and graphics subsystem in user mode, thus demonstrating
initial tests show that we can also drive gigabit Ethernet that Microsoft has come to the same insight that we have:
at full speed. Finally, the time from exiting the multiboot user-mode drivers are the way to go.
monitor to the login prompt is under 5 sec.
It has to be noted that the prototype incorporates many Linux and Isolated Drivers Linux has basically the
new security checks that cause some overhead. Further- same monolithic style as Windows. One major differ-
more, we did not yet do any performance optimizations. ence with Windows is that the graphics subsystem in
Careful analysis and removal of bottlenecks may boost Linux—like other UNIX system—has always been in
the performance. We believe a performance penalty of user space. As the system ages, it acquires more and
less than 5% is realistic. more functionality that ends up in the kernel, with all
consequences for maintainability and reliability.
An important project to improve the reliability of com-
7 RELATED WORK modity systems such as Linux is Nooks [19, 20]. Nooks
keeps device drivers in the kernel but transparently en-
In this section we review some related operating systems. closes them in a kind of lightweight protective wrapper
Note that we survey complete operating systems and not so that driver bugs cannot propagate to other parts of the
individual kernels to make the comparison fair. In other operating system. All traffic between the driver and the
words, we compare our complete POSIX-conformant rest of the kernel is inspected by the reliability layer.
operating system to other systems that provide a com- Another project uses virtual machines to isolate de-
parable full system call interface. vice drivers from the rest of the system [11, 12]. When

11
a driver is called, it is run on a different virtual machine 7.3 Single-Server Systems
than the main system so that a crash or other fault does
not pollute the main system. In addition to isolation, While these systems are more modular than the previ-
this technique enables unmodified reuse of device drivers ous examples, the operating system still runs as a huge,
when experimenting with new operating systems. monolithic server. Several systems follow this design.
A recent project ran Linux device drivers in user mode
with small changes to the Linux kernel [10]. This work L4 Linux-based Systems A typical example of a
shows that drivers can be isolated in separate user-mode single-server system is L4 Linux, in which Linux is run
processes without significant performance degradation. on top of the L4 microkernel. User processes obtain
While isolating device drivers helps to improve the operating system services by making remote procedure
reliability of legacy operating systems, we believe a calls to the Linux server using L4’s IPC mechanism.
proper, modular design from scratch gives better results. Measurements show the performance penalty over native
This includes encapsulating all operating system com- Linux to be about 5% [6].
ponents (e.g., file system, memory manager) in indepen- A real-time system built with help of L4 Linux is
dent, user-mode processes. DROPS [5]. It is targeted toward multimedia applica-
tions. However, most of the device drivers still run as
part of a big L4 Linux server, with only the multimedia
Mac OS X Apple MacOS X is yet another example
subsystem running separately.
of a monolithic kernel, but it has a layered kernel struc-
ture. The lowest layer is a microkernel based on Mach. Perseus [13] is another L4 Linux-based system. It
The BSD UNIX personality is part of the kernel, as was designed to provide secure digital signatures while
are various other components. This makes it simply a still supporting legacy applications for Linux. L4 Linux
differently-structured monolithic kernel, but does not add is used for most operations, but whenever a document
many reliability features. needs to be signed control is given to a trusted subsys-
tem that includes a signature server.
The problem with all these systems is that a single bug
VxWorks VxWorks is a POSIX-compliant, real-time in, say, a device driver can still crash the entire operat-
operating system, generally used in embedded systems. ing system server. The only gain of this design from a
The core of VxWorks is indicated as the ‘Wind’ micro- reliability point of view is a faster reboot.
kernel, but, in fact, the kernel contains the operating sys-
tem, including device drivers, and thus has a monolithic
structure. VxWorks has historically provided only ker- 7.4 Multiserver Systems
nel mode, requiring users to develop exclusively in this
Multiserver operating systems distribute functionality
mode. Only recently, VxWorks AE provided a real-time
over multiple, isolated components. Although several
process model that enables memory protection between
multiserver systems exists, to the best of our knowledge,
processes. While this allows to run isolated user-mode
nobody has yet built and released a working, fully mod-
applications, it is up to the developer to enable the pro-
ular, open-source, multiserver operating system.
tection; ours is mandatory.

SawMill Linux SawMill Linux [4] is a sophisticated

CapROS and Coyotos CapROS is a persistent,
approach is to split the operating system into pieces
capability-based system designed to be highly reliable,
and run each one in its own protection domain. While
and is based on EROS [17]. The kernel includes crit-
the system was designed to be fully modular, the fo-
ical drivers and the persistence manager. The fact that
cus was on efficiency instead of reliability. As discussed
device drivers are tolerated in the kernel means that a
in Sec. 2.3, the project was abruptly terminated in 2001
driver bug might bring down the entire system. Although
when many of the principals left IBM, and the only out-
persistence allows processes to be resumed as of the last
come was a rudimentary, unfinished prototype.
checkpoint, we believe it is better to prevent a full oper-
ating system crash in the first place.
Coyotos is also based on EROS, but will be exploring Mach-US In Sec. 2.2, Mach-UX was mentioned as ex-
the limits of software verification in operating systems. ample of a single-server system. A later project known
It uses a new programming language, BitC, with a well- as Mach-US [18] tried to split to operating system into a
defined semantics to formally verify security and correct- set of servers, each of which supports orthogonal system
ness properties. Coyotos has not been released and little services, much like our system. However, like SawMill
has been published about it. Linux, this system never left the development stage.

12
Singularity A recent multiserver system developed by Our system represents a new data point in the spec-
Microsoft Research is Singularity [9]. In contrast to trum from monolithic to fully modular structure. The
other systems, Singularity is based on language safety design of consists of a small kernel running the entire
and bypasses hardware protection offered by the MMU. operating system as a collection of independent, isolated,
The trusted base consists of parts of the kernel and run- user-mode processes. While people have tried to produce
time system that are not verifiably safe. The operating a fully modular microkernel-based UNIX clone with de-
system is run on top of the kernel as a set of verifiably- cent performance for years (such as GNU Hurd), we have
safe, software-isolated servers, each running under the actually done it, tested it heavily, and released it.
control of its own run-time system. A contract speci- The kernel implements only the minimal mechanisms
fies allowable interactions via state machine-driven IPC required to build an operating system upon. It provides
declarations. These contracts can be statically verified, IPC, scheduling, interrupt handling, and contains two
but are complex and hard to get correct without knowl- kernel tasks (SYS and CLOCK) to support the user-mode
edge of formal specifications. Building applications for operating system parts. The core servers are the process
Singularity means a paradigm shift for the programmer, manager (PM), memory manager (MM), file server (FS),
making it less suitable for large-scale adoption. reincarnation server (RS), and data store (DS). Since the
size of these components ranges from about 1000 to 4000
lines of code, they are easy to understand and maintain.
Symbian OS This operating system is designed for
small, handheld devices, especially mobile phones. Additional operating system services, such as device
Symbian shares many characteristics with monolithic drivers, the information server (IS), window system (X),
systems. For example, process management, memory and network server (INET), can be started on the fly and
management, device drivers, and dynamically loadable are guarded by RS. The system is robust and self heal-
modules all are implemented in the kernel. Only the ing, so that it can withstand and automatically recover
file server, and the networking and telephony stacks are from common failures in these components, transparent
hosted in user-mode servers. to applications and without user intervention.

QNX This is a closed-source, commercial UNIX-like 9 ACKNOWLEDGMENTS

real-time operating systems [8]. Although QNX has a
multiserver design, from recent information sheets we We thank Chandana Gamage, Kemal Bicakci, and Bruno
conclude that the QNX kernel, Neutrino, contains pro- Crispo for critically reviewing the paper.
cess management and other functions which could have This work was supported by the Dutch Organization
been removed. Furthermore, QNX does not have a for Scientific Research (NWO) under grant 612-060-420.
POSIX-conformant API.

Nemisis Yet another multiserver operating system con- 10 AVAILABILITY

taining user-mode device drivers [14]. This system had
a single address space shared by all processes, but with The system is called MINIX 3 because we started with
hardware protection between processes. Like DROPS, MINIX 2 and then modified it very heavily. Starting from
it was aimed at multimedia applications, but it was not a known base, we were spared the task of writing boring
POSIX conformant, or even UNIX like. pieces of code, such as device drivers, interrupt handlers,
and part of a file system. The resulting system has only a
passing resemblance to MINIX 2; it is really a completely
8 CONCLUSIONS new system, but for historical reasons, we settled on the
name MINIX 3.
We have demonstrated that it is feasible to build a highly- MINIX 3 is free, open-source software, available via
reliable, multiserver operating system with a perfor- the Internet. You can download MINIX 3 from the offi-
mance loss of only 5% to 10%. We have discussed the cial homepage at: http://www.minix3.org/, which also con-
design and implementation of a serious, stable, prototype tains the source code, documentation, news, contributed
that currently runs hundreds of standard UNIX applica- software packages, and more. Over 100,000 people have
tions, including two C compilers, language processors downloaded the CD-ROM image in the first 3 months, re-
(e.g., awk, bison, flex, perl, python, yacc), many editors sulting in a large and growing user community that com-
(e.g., emacs, nvi, vim, elvis, elle), networking (e.g., tel- municates using the USENET newgroup comp.os.minix.
net, ftp, kermit, talk, wget), and all the standard shell, file, MINIX 3 is actively being developed, and your help and
text manipulation, and other UNIX (and GNU) utilities. feedback are much appreciated.

13
View publication stats

References [13] P FITZMANN , B., AND S T ÜBLE , C. Perseus: A Quick

Open-source Path to Secure Signatures. In 2nd Workshop
[1] ACCETTA , M., BARON , R., B OLOSKY, W., G OLUB , D., on Microkernel-based Systems (2001).
R ASHID , R., T EVANIAN , A., AND YOUNG , M. Mach:
[14] ROSCOE , T. The Structure of a MultiService Operating
A New Kernel Foundation for UNIX Development. In
System. Ph.D. Dissertation, Cambridge University.
Proc. of USENIX’86 (1986), pp. 93–113.
[15] S ALTZER , J., AND S CHROEDER , M. The Protection of
[2] C HOU , A., YANG , J., C HELF, B., H ALLEM , S., AND
Information in Computer Systems. Proceedings of the
E NGLER , D. An Empirical Study of Operating System
IEEE 63, 9 (Sept. 1975).
Errors. In Proc. 18th ACM Symp. on Oper. Syst. Prin.
(2001), pp. 73–88. [16] S EAWRIGHT, L., AND M AC K INNON , R. VM/370—A
[3] E NGLER , D., K AASHOEK , M., AND J. O’T OOLE , Study of Multiplicity and Usefulness. IBM Systems Jour-
J. Exokernel: an operating system architecture for nal 18, 1 (1979), 4–17.
application-level resource management. In Proc. 15th [17] S HAPIRO , J. S., S MITH , J. M., AND FARBER , D. J.
ACM Symp. on Oper. Syst. Prin. (1995), pp. 251–266. EROS: A fast capability system. In Proc. 17th Symp. on
[4] G EFFLAUT, A., JAEGER , T., PARK , Y., L IEDTKE , Oper. Syst. Design and Impl. (Dec. 1999), pp. 170–185.
J., E LPHINSTONE , K., U HLIG , V., T IDSWELL , J., [18] S TEVENSON , J. M., AND J ULIN , D. P. Mach-US: UNIX
D ELLER , L., AND R EUTHER , L. The SawMill Multi- On Generic OS Object Servers. In Proc. USENIX’95 (Jan.
server Approach. In ACM SIGOPS European Workshop 1995), pp. 119–130.
(Sept. 2000), pp. 109–114.
[19] S WIFT, M., A NNAMALAI , M., B ERSHAD , B., AND
[5] H ÄRTIG , H., BAUMGARTL , R., B ORRISS , M., L EVY, H. Recovering Device Drivers. In Proc. 6th Symp.
H AMANN , C.-J., H OHMUTH , M., M EHNERT, F., on Oper. Syst. Design and Impl. (2004), pp. 1–15.
R EUTHER , L., S CHONBERG , S., AND W OLTER , J.
[20] S WIFT, M., B ERSHAD , B., AND L EVY, H. Improving
DROPS OS Support for Distributed Multimedia Appli-
the Reliability of Commodity Operating Systems. ACM
cations. In Proc. 8th ACM SIGOPS European Workshop
Trans. on Comp. Syst. 23, 1 (2005), 77–110.
(Sept. 1998), pp. 203–209.
[21] TANENBAUM , A. S., AND W OODHULL , A. S. Operat-
[6] H ÄRTIG , H., H OHMUTH , M., L IEDTKE , J.,
ing Systems Design and Implementation, 1st ed. Prentice-
S CH ÖNBERG , S., AND W OLTER , J. The Perfor-
Hall, 1987.
mance of -Kernel-Based Systems. In Proc. 6th Symp. on
Oper. Syst. Design and Impl. (Oct. 1997), pp. 66–77. [22] T.J. O STRAND AND E.J. W EYUKER. The Distribution of
Faults in a Large Industrial Software System. In Proc. of
[7] H ERDER , J. N., B OS , H., AND TANENBAUM , A. S.
the 2002 ACM SIGSOFT Int’l Symp. on Software Testing
A Lightweight Method for Building Reliable Operating
and Analysis (2002), ACM, pp. 55–64.
Systems Despite Unreliable Device Drivers. In Technical
Report (Jan. 2006). [23] T.J. O STRAND AND E.J. W EYUKER AND AND R.M.
[8] H ILDEBRAND , D. An Architectural Overview of QNX. B ELL. Where the Bugs Are. In Proc. of the 2004 ACM
In Proc. USENIX Workshop in Microkernels and Other SIGSOFT Int’l Symp. on Software Testing and Analysis
Kernel Architectures (Apr. 1992), pp. 113–126. (2004), ACM, pp. 86–96.

[9] H UNT, G. C., L ARUS , J. R., A BADI , M., A IKEN ,

M., BARHAM , P., FAHNDRICH , M., H AWBLITZEL , C.,
H ODSON , O., L EVI , S., M URPHY, N., S TEENSGAARD ,
B., TARDITI , D., W OBBER , T., AND Z ILL , B. An
Overview of the Singularity Project. Tech. Rep. MSR-
TR-2005-135, Microsoft Research, Redmond, WA, USA,
Oct. 2005.
[10] L ESLIE , B., C HUBB , P., F ITZROY-DALE , N., G OTZ ,
S., G RAY, C., M ACPHERSON , L., DANIEL P OTTS , Y.-
T. S., E LPHINSTONE , K., AND H EISER , G. User-Level
Device Drivers: Achieved Performance. Journal of Com-
puter Science and Technology 20, 5 (Sept. 2005).
[11] L E VASSEUR , J., AND U HLIG , V. A Sledgehammer Ap-
proach to Reuse of Legacy Device Drivers. In Proc. 11th
ACM SIGOPS European Workshop (Sept. 2004), pp. 131–
136.
[12] L E VASSEUR , J., U HLIG , V., S TOESS , J., AND G OTZ , S.
Unmodified Device Driver Reuse and Improved System
Dependability via Virtual Machines. In Proc. 6th Symp.
on Oper. Syst. Design and Impl. (Dec. 2004), pp. 17–30.

Managing The Fifth Generation (5G) Wireless Mobile Communication: A Machine Learning Approach For Network Traffic Prediction
No ratings yet
Managing The Fifth Generation (5G) Wireless Mobile Communication: A Machine Learning Approach For Network Traffic Prediction
6 pages
Low-Level Light Therapy Photobiomodulation PDF
100% (1)
Low-Level Light Therapy Photobiomodulation PDF
18 pages
Triethanolamine
No ratings yet
Triethanolamine
26 pages
TRAbkch 02
No ratings yet
TRAbkch 02
29 pages
Benn Torresetal 2015 Genetic Diversityinthe Lesser Antilles
No ratings yet
Benn Torresetal 2015 Genetic Diversityinthe Lesser Antilles
28 pages
Meta WD 13 Upd
No ratings yet
Meta WD 13 Upd
24 pages
KuhfuMaldeiHetmanekBaumann 2021 Somaticexperiencing Ascopingliteraturereview Finale Publikation
No ratings yet
KuhfuMaldeiHetmanekBaumann 2021 Somaticexperiencing Ascopingliteraturereview Finale Publikation
19 pages
Clinical Practice Guideline: Tonsillectomy in Children (Update)
No ratings yet
Clinical Practice Guideline: Tonsillectomy in Children (Update)
43 pages
RTCReporting - Messagelog - 2025-03-01-03-53-01 2
No ratings yet
RTCReporting - Messagelog - 2025-03-01-03-53-01 2
395 pages
Youdetal 2001LiquefactionResistanceofSoils
No ratings yet
Youdetal 2001LiquefactionResistanceofSoils
18 pages
Vste2017 19196 001
No ratings yet
Vste2017 19196 001
20 pages
HD165246 Arxiv
No ratings yet
HD165246 Arxiv
16 pages
Download
No ratings yet
Download
13 pages
Youdetal 2001LiquefactionResistanceofSoils
No ratings yet
Youdetal 2001LiquefactionResistanceofSoils
18 pages
The Job Characteristics Model in Hong Kong
No ratings yet
The Job Characteristics Model in Hong Kong
9 pages
FAPUsing ACLBehaviorismto Promote Change
No ratings yet
FAPUsing ACLBehaviorismto Promote Change
11 pages
The Theory of Reasoned Action: January 2003
No ratings yet
The Theory of Reasoned Action: January 2003
29 pages
MaRud Haim2017JGR
No ratings yet
MaRud Haim2017JGR
17 pages
Global Differences in Attributes of Email Usage
No ratings yet
Global Differences in Attributes of Email Usage
11 pages
Kempert Saalbach Hardy 2011
No ratings yet
Kempert Saalbach Hardy 2011
16 pages
Analysis of Multi-Party Agreement in Requirements .En - Es
No ratings yet
Analysis of Multi-Party Agreement in Requirements .En - Es
11 pages
Fuertes Et Al 2012 A Meta-Analysis of The Effects of Speakers Accents - Clean
No ratings yet
Fuertes Et Al 2012 A Meta-Analysis of The Effects of Speakers Accents - Clean
16 pages
Straalsund JHE 2018
No ratings yet
Straalsund JHE 2018
22 pages
A Comparison of Feature Extraction Methods For The Classification of Dynamic Activities From Accelerometer Data
No ratings yet
A Comparison of Feature Extraction Methods For The Classification of Dynamic Activities From Accelerometer Data
10 pages
Kyriazis Et Al 2018 Colonization and Diversification
No ratings yet
Kyriazis Et Al 2018 Colonization and Diversification
12 pages
Cerebral Palsy: Nature Reviews Disease Primers January 2016
No ratings yet
Cerebral Palsy: Nature Reviews Disease Primers January 2016
25 pages
Epithermal Gold Deposits: Styles, Characteristics, and Exploration
No ratings yet
Epithermal Gold Deposits: Styles, Characteristics, and Exploration
19 pages
Exercise As A Treatment For Depression - A Meta-Analysis Adjusting For Publication Bias PDF
No ratings yet
Exercise As A Treatment For Depression - A Meta-Analysis Adjusting For Publication Bias PDF
11 pages
Valproic Acid Pathway Pharmacokinetics and Pharmac
No ratings yet
Valproic Acid Pathway Pharmacokinetics and Pharmac
7 pages
Commitment Issues in Delegation Process
No ratings yet
Commitment Issues in Delegation Process
13 pages
Sandweiss Etal PNAS09
No ratings yet
Sandweiss Etal PNAS09
6 pages
2020 Review Papaya Biocontrol
No ratings yet
2020 Review Papaya Biocontrol
12 pages
Bipolar Depression: Overview and Commentary: Harvard Review of Psychiatry June 2010
No ratings yet
Bipolar Depression: Overview and Commentary: Harvard Review of Psychiatry June 2010
17 pages
Choetal 2019
No ratings yet
Choetal 2019
9 pages
Beaver Drones Auburn
No ratings yet
Beaver Drones Auburn
11 pages
Mind Your Errors: Psychological Science December 2011
No ratings yet
Mind Your Errors: Psychological Science December 2011
8 pages
Position Statement On Youth Resistance Training, The 2014 International Consensus
No ratings yet
Position Statement On Youth Resistance Training, The 2014 International Consensus
14 pages
Treatment of Day Time Urinary Incontinen
No ratings yet
Treatment of Day Time Urinary Incontinen
9 pages
Towards Modeling The Effects of Lightning Injectio PDF
No ratings yet
Towards Modeling The Effects of Lightning Injectio PDF
12 pages
Current Extent of Agroforestry in Europe: Conference Paper
No ratings yet
Current Extent of Agroforestry in Europe: Conference Paper
5 pages
Williams-Jones 2009 - Gold in Solution
No ratings yet
Williams-Jones 2009 - Gold in Solution
8 pages
2009 Robertsetal SSSAJ
No ratings yet
2009 Robertsetal SSSAJ
9 pages
2014HMRIrisinstudy PDF
No ratings yet
2014HMRIrisinstudy PDF
7 pages
DFE and The Effect of Rare Events in Risky Choice
No ratings yet
DFE and The Effect of Rare Events in Risky Choice
7 pages
NatGeosci2012Castelltort 1
No ratings yet
NatGeosci2012Castelltort 1
6 pages
Combating The Elsagate Phenomenon: Deep Learning Architectures For Disturbing Cartoons
No ratings yet
Combating The Elsagate Phenomenon: Deep Learning Architectures For Disturbing Cartoons
7 pages
The Virtual Meditative Walk: Virtual Reality Therapy For Chronic Pain Management
No ratings yet
The Virtual Meditative Walk: Virtual Reality Therapy For Chronic Pain Management
5 pages
Relation of Chemical and Mechanical Properties of Eucalyptus Nitens Wood Thermally Modified in Open and Closed Systems
No ratings yet
Relation of Chemical and Mechanical Properties of Eucalyptus Nitens Wood Thermally Modified in Open and Closed Systems
11 pages
Dead Band Paper I Eee 07134797
No ratings yet
Dead Band Paper I Eee 07134797
11 pages
Mon Tane Zeta L 1996
No ratings yet
Mon Tane Zeta L 1996
5 pages
RPM 하이큐 수학（상）－ 문제
No ratings yet
RPM 하이큐 수학（상）－ 문제
144 pages
17CS64 Module 1
No ratings yet
17CS64 Module 1
27 pages
Monolithic Vs Microkernel
100% (1)
Monolithic Vs Microkernel
6 pages
Platform Technologies Module 2
No ratings yet
Platform Technologies Module 2
57 pages
OSLecture 2
No ratings yet
OSLecture 2
58 pages
De La La Salle University - Manila Syllabus
No ratings yet
De La La Salle University - Manila Syllabus
4 pages
MCQTonghop
No ratings yet
MCQTonghop
112 pages
RTCReporting - Messagelog - 2024 06 11 19 21 34
No ratings yet
RTCReporting - Messagelog - 2024 06 11 19 21 34
357 pages
Operating Systems Lecture Notes
No ratings yet
Operating Systems Lecture Notes
196 pages
Principios de Bioquímica - Lehninger (Caps 1-8)
No ratings yet
Principios de Bioquímica - Lehninger (Caps 1-8)
339 pages
1053 - Inside Macosx Kernel PDF
No ratings yet
1053 - Inside Macosx Kernel PDF
74 pages
Mishkin 1996 The Channels of Monetary Transmission
No ratings yet
Mishkin 1996 The Channels of Monetary Transmission
29 pages
7
No ratings yet
7
140 pages
Operating System - Module II
No ratings yet
Operating System - Module II
13 pages
Snow Leopard Server Security Config v10.6
No ratings yet
Snow Leopard Server Security Config v10.6
456 pages
Giu 2531 61 12754 2023-09-26T07 37 37
No ratings yet
Giu 2531 61 12754 2023-09-26T07 37 37
50 pages
Introduction To MAC OS
No ratings yet
Introduction To MAC OS
109 pages
Mach Final Case Study
No ratings yet
Mach Final Case Study
24 pages
Networking Performance Microkernels
No ratings yet
Networking Performance Microkernels
6 pages
Starzplay Dec Data
No ratings yet
Starzplay Dec Data
423 pages
LU4 - Distributed Operating System
No ratings yet
LU4 - Distributed Operating System
44 pages
Chapter 2 Operating-System Structures
No ratings yet
Chapter 2 Operating-System Structures
53 pages
Sholomon 2016
No ratings yet
Sholomon 2016
23 pages
TMGCMXL
No ratings yet
TMGCMXL
84 pages
MDM Airwatch Users
No ratings yet
MDM Airwatch Users
24 pages
L4 OS Structure II Microkernels and Exokernels
No ratings yet
L4 OS Structure II Microkernels and Exokernels
23 pages
Chapter 2: Operating-System Structures: Mekelle Institute of Technology (MIT) Nov-2018
No ratings yet
Chapter 2: Operating-System Structures: Mekelle Institute of Technology (MIT) Nov-2018
52 pages
Untitled Form (Responses)
No ratings yet
Untitled Form (Responses)
4 pages
Different Types of Computing
No ratings yet
Different Types of Computing
19 pages
The Pragmatic Linux Performance Handbook
From Everand
The Pragmatic Linux Performance Handbook
Samuel Aitken
No ratings yet
Linux for Beginners: How to Master the Linux Operating System and Command Line form Scratch
From Everand
Linux for Beginners: How to Master the Linux Operating System and Command Line form Scratch
Noah Herrmann
No ratings yet
Linux for Beginners: Linux Command Line, Linux Programming and Linux Operating System
From Everand
Linux for Beginners: Linux Command Line, Linux Programming and Linux Operating System
Steve Will
4.5/5 (3)
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
The 101 Most Important UNIX and Linux Commands
From Everand
The 101 Most Important UNIX and Linux Commands
Ronald J. Leach
No ratings yet
Operating Systems: Concepts to Save Money, Time, and Frustration
From Everand
Operating Systems: Concepts to Save Money, Time, and Frustration
Jonathan Rigdon
No ratings yet
Operating System Text Book
From Everand
Operating System Text Book
Manish Soni
No ratings yet
Linux: A Comprehensive Guide to Linux Operating System and Command Line
From Everand
Linux: A Comprehensive Guide to Linux Operating System and Command Line
Sam Griffin
No ratings yet
Operating System Interview Questions and Answers
From Everand
Operating System Interview Questions and Answers
Manish Soni
No ratings yet
Most used commands in Linux and Unix
From Everand
Most used commands in Linux and Unix
Alex Carvalho
No ratings yet
Linux: A complete guide to Linux command line for beginners, and how to get started with the Linux operating system!
From Everand
Linux: A complete guide to Linux command line for beginners, and how to get started with the Linux operating system!
James Arthur
No ratings yet

The Design and Implementation of A Fully-Modular S

Uploaded by

The Design and Implementation of A Fully-Modular S

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

The Design and Implementation of a Fully-Modular, Self-Healing, UNIX-Like

Jorrit N. Herder Herbert Bos

SEE PROFILE SEE PROFILE

Philip Homburg Andrew S. Tanenbaum

SEE PROFILE SEE PROFILE

Automating Live Update for Generic Server Programs View project

Exploit Mitigation View project

The user has requested enhancement of the downloaded file.

Abstract structure, combined with several explicit mechanisms for

1.2 Paper Outline

2.2 Single-Server Systems

services are provided by a single, monolithic program

Driver Driver Name intermediate buffering. The interaction is fully syn-

3.5 The Kernel Tasks

always blocked waiting for a new request message.

user or system time and decrements the scheduling quan-

4.2.1 Information Server (IS) 5.1 Programming Environment

SawMill Linux SawMill Linux [4] is a sophisticated

QNX This is a closed-source, commercial UNIX-like 9 ACKNOWLEDGMENTS

Nemisis Yet another multiserver operating system con- 10 AVAILABILITY

References [13] P FITZMANN , B., AND S T ÜBLE , C. Perseus: A Quick

[9] H UNT, G. C., L ARUS , J. R., A BADI , M., A IKEN ,

You might also like