Nightingale 06
Nightingale 06
USENIX Association OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation 1
trast to asynchronous I/O, which improves performance — this is typically less than the perception threshold of a
by substantially weakening these guarantees, externally human user.
synchronous I/O provides the same guarantees, but it Xsyncfs uses output-triggered commits to balance
changes the clients to which the guarantees are provided. throughput and latency. Output-triggered commits track
Synchronous I/O reflects the application-centric view of the causal relationship between external output and file
modern operating systems. The return of a synchronous system modifications to decide when to commit data.
file system call guarantees durability to the application Until some external output is produced that depends
since the calling process is blocked until modifications upon modified data, xsyncfs may delay committing data
commit. In contrast, externally synchronous I/O takes a to optimize for throughput. However, once some output
user-centric view in which it guarantees durability not to is buffered that depends upon an uncommitted modifica-
the application, but to any external entity that observes tion, an immediate commit of that modification is trig-
application output. An externally synchronous system gered to minimize latency for any external observer.
returns control to the application before committing data. Our results to date are very positive. For I/O inten-
However, it subsequently buffers all output that causally sive benchmarks such as Postmark and an Andrew-style
depends on the uncommitted modification. Buffered out- build, the performance of xsyncfs is within 7% of the de-
put is only externalized (sent to the screen, network, or fault asynchronous implementation of ext3. Compared to
other external device) after the modification commits. current implementations of synchronous I/O in the Linux
From the viewpoint of an external observer such as a kernel, external synchrony offers better performance and
user or an application running on another computer, the better reliability. Xsyncfs is up to an order of magni-
guarantees provided by externally synchronous I/O are tude faster than the default version of ext3 mounted syn-
identical to the guarantees provided by a traditional file chronously, which allows data to be lost on power fail-
system mounted synchronously. An external observer ure because committed data may reside in the volatile
never sees output that depends on uncommitted modi- hard drive cache. Xsyncfs is up to two orders of mag-
fications. Since external synchrony commits modifica- nitude faster than a version of ext3 that guards against
tions to disk in the order they are generated by appli- losing data on power failure. Xsyncfs sometimes even
cations, an external observer will not see a modification improves the performance of applications that do their
unless all other modifications that causally precede that own custom synchronization. Running on top of xsyncfs,
modification are also visible. However, because exter- the MySQL database executes a modified version of the
nally synchronous I/O rarely blocks applications, its per- TPC-C benchmark up to three times faster than when it
formance approaches that of asynchronous I/O. runs on top of ext3 mounted asynchronously.
Our externally synchronous Linux file system, xsyncfs,
uses mechanisms developed as part of the Speculator
2 Design overview
project [17]. When a process performs a synchronous
I/O operation, xsyncfs validates the operation, adds the 2.1 Principles
modifications to a file system transaction, and returns The design of external synchrony is based on two princi-
control to the calling process without waiting for the ples. First, we define externally synchronous I/O by its
transaction to commit. However, xsyncfs also taints the externally observable behavior rather than by its imple-
calling process with a commit dependency that specifies mentation. Second, we note that application state is an
that the process is not allowed to externalize any output internal property of the computer system. Since applica-
until the transaction commits. If the process writes to tion state is not directly observable by external entities,
the network, screen, or other external device, its output the operating system need not treat changes to applica-
is buffered by the operating system. The buffered output tion state as an external output.
is released only after all disk transactions on which the
output depends commit. If a process with commit depen- Synchronous I/O is usually defined by its implementa-
dencies interacts with another process on the same com- tion: an I/O is considered synchronous if the calling ap-
puter through IPC such as pipes, the file cache, or shared plication is blocked until after the I/O completes [26].
memory, the other process inherits those dependencies In contrast, we define externally synchronous I/O by its
so that it also cannot externalize output until the trans- observable behavior: we say that an I/O is externally syn-
action commits. The performance of xsyncfs is gener- chronous if the external output produced by the computer
ally quite good since applications can perform computa- system cannot be distinguished from output that could
tion and initiate further I/O operations while waiting for have been produced if the I/O had been synchronous.
a transaction to commit. In most cases, output is delayed The next step is to precisely define what is considered ex-
by no more than the time to commit a single transaction ternal output. Traditionally, the operating system takes
2 OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
time time
user user
buffer
block block
app app
OS OS
disk disk
commit commit commit
This figure shows the behavior of a sample application that makes two file system modifications, then displays output to an external
device. The diagram on the left shows how the application executes when its file I/O is synchronous; the diagram on the right shows
how it executes when its file I/O is externally synchronous.
an application-centric view of the computer system, in We say that external output of an externally synchronous
which it considers applications to be external entities ob- system is equivalent to the output of a synchronous one if
serving its behavior. This view divides the computer sys- (a) the values of the external outputs are the same, and (b)
tem into two partitions: the kernel, which is considered the outputs occur in the same causal order, as defined by
internal state, and the user level, which is considered ex- Lamport’s happens before relation [9]. We consider disk
ternal state. Using this view, the return from a system commits external output because they change the stable
call is considered an externally visible event. image of the file system. If the system crashes and re-
However, users, not applications, are the true observers boots, the change to the stable image is visible. Since the
of the computer system. Application state is only visible operating system cannot control when crashes occur, it
through output sent to external devices such as the screen must treat disk commits as external output. Thus, in Fig-
and network. By regarding application state as internal ure 1(a), there are three external outputs: the two com-
to the computer system, the operating system can take mits and the message displayed on the screen.
a user-centric view in which only output sent to an ex- An externally synchronous file I/O returns the same re-
ternal device is considered externally visible. This view sult to applications that would have been returned by
divides the computer system into three partitions, the ker- a synchronous I/O. The file system does all processing
nel and applications, both of which are considered inter- that would be done for a synchronous I/O, including val-
nal state, and the external interfaces, which are consid- idation and changing the volatile (in-memory) state of
ered externally visible. Using this view, changes to ap- the file system, except that it does not actually commit
plication state, such as the return from a system call, are the modification to disk before returning. Because the
not considered externally visible events. results that an application sees from an externally syn-
The operating system can implement user-centric guar- chronous I/O are equivalent to the results it would have
antees because it controls access to external devices. Ap- seen if the I/O had been synchronous, the external output
plications can only generate external events with the co- it produces is the same in both cases.
operation of the operating system. Applications must in- An operating system that supports external synchrony
voke this cooperation either directly by making a system must ensure that external output occurs in the same
call or indirectly by mapping an externally visible device. causal order that would have occurred had I/O been per-
formed synchronously. Specifically, if an external out-
2.2 Correctness
put causally follows an externally synchronous file I/O,
Figure 1 illustrates these principles by showing an exam- then that output cannot be observed before the file I/O
ple single-threaded application that makes two file sys- has been committed to disk. In the example, this means
tem modifications and writes some output to the screen. that the second file modification made by the application
In the diagram on the left, the file modifications made by cannot commit before the first, and that the screen output
the application are synchronous. Thus, the application cannot be seen before both modifications commit.
blocks until each modification commits.
USENIX Association OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation 3
2.3 Improving performance 2.4 Deciding when to commit
The externally synchronous system in Figure 1(b) makes An externally synchronous file system uses the causal re-
two optimizations to improve performance. First, the two lationship between external output and file modifications
modifications are group committed as a single file system to trigger commits. There is a well-known tradeoff be-
transaction. Because the commit is atomic, the effects of tween throughput and latency for group commit strate-
the second modification are never seen unless the effects gies. Delaying a group commit in the hope that more
of the first are also visible. Grouping multiple modifica- modifications will occur in the near future can improve
tions into one transaction has many benefits: the commit throughput by amortizing more modifications across a
of all modifications is done with a single sequential disk single commit. However, delaying a commit also in-
write, writes to the same disk block are coalesced in the creases latency — in our system, commit latency is es-
log, and no blocks are written to disk at all if data writes pecially important because output cannot be externalized
are closely followed by deletion. For example, ext3 em- until the commit occurs.
ploys value logging — when a transaction commits, only
Latency is unimportant if no external entity is observ-
the latest version of each block is written to the journal. ing the result. Specifically, until some output is gener-
If a temporary file is created and deleted within a single ated that causally depends on a file system transaction,
transaction, none of its blocks are written to disk. In con-
committing the transaction does not change the observ-
trast, a synchronous file system cannot group multiple able behavior of the system. Thus, the operating sys-
modifications for a single-threaded application because
tem can improve throughput by delaying a commit until
the application does not begin the second modification
some output that depends on the transaction is buffered
until after the first commits. (or until some application that depends on the transac-
The second optimization is buffering screen output. The tion blocks due to an ioctl or similar system call). We
operating system must delay the externalization of the call this strategy output-triggered commits since the at-
output until after the commit of the file modifications tempt to generate output that is causally dependent upon
to obey the causal ordering constraint of externally syn- modifications to be written to disk triggers the commit of
chronous I/O. One way to enforce this ordering would be those modifications.
to block the application when it initiates external output. Output-triggered commits enable an externally syn-
However, the asynchronous nature of the output enables chronous file system to maximize throughput when out-
a better solution. The operating system instead buffers
put is not being displayed (for example, when it is piped
the output and allows the process that generated the out- to a file). However, when a user could be actively observ-
put to continue execution. After the modifications are
ing the results of a transaction, commit latency is small.
committed to disk, the operating system releases the out-
put to the device for which it was destined. 2.5 Limitations
This design requires that the operating system track the
causal relationship between file system modifications One potential limitation of external synchrony is that
and external output. When a process writes to the file it complicates application-specific recovery from catas-
system, it inherits a commit dependency on the uncom- trophic media failure because the application continues
mitted data that it wrote. When a process with commit execution before such errors are detected. Although the
dependencies modifies another kernel object (process, kernel validates each modification before writing it to the
pipe, file, UNIX socket, etc.) by executing a system call, file cache, the physical write of the data to disk may sub-
the operating system marks the modified objects with the sequently fail. While smaller errors such as a bad disk
same commit dependencies. Similarly, if a process ob- block are currently handled by the disk or device driver,
serves the state of another kernel object with commit de- a catastrophic media failure is rarely masked at these lev-
pendencies, the process inherits those dependencies. If els. Theoretically, a file system mounted synchronously
a process with commit dependencies executes a system could propagate such failures to the application. How-
call for which the operating system cannot track the flow ever, a recent survey of common file systems [20] found
of causality (e.g., an ioctl), the process is blocked until that write errors are either not detected by the file sys-
its file systems modifications have been committed. Any tem (ext3, jbd, and NTFS) or induce a kernel panic
external output inherits the commit dependencies of the (ReiserFS). An externally synchronous file system could
process that generated it — the operating system buffers propagate failures to applications by using Speculator to
the output until the last dependency is resolved by com- checkpoint a process before it modifies the file system. If
mitting modifications to disk. a catastrophic failure occurs, the process would be rolled
back and notified of the failure. We rejected this solution
because it would both greatly increase the complexity
4 OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
of external synchrony and severely penalize its perfor- uses these data structures to track causal dependencies.
mance. Further, it is unclear that catastrophic failures are For example, if a speculative process writes to a pipe,
best handled by applications — it seems best to handle Speculator creates an entry in the pipe’s undo log that
them in the operating system, either by inducing a kernel refers to the speculations on which the writing process
panic or (preferably) by writing data elsewhere. depends. If another process reads from the pipe, Spec-
ulator creates an undo log entry for the reading process
Another limitation of external synchrony is that the user
may have some temporal expectations about when modi- that refers to all speculations on which the pipe depends.
fications are committed to disk. As defined so far, an ex- Speculator ensures that speculative state is never visible
ternally synchronous file system could indefinitely delay to an external observer. If a speculative process exe-
committing data written by an application with no exter- cutes a system call that would normally externalize out-
nal output. If the system crashes, a substantial amount put, Speculator buffers its output until the outcome of the
of work could be lost. Xsyncfs therefore commits data speculation is decided. If a speculative process performs
every 5 seconds, even if no output is produced. The 5 a system call that Speculator is unable to handle by ei-
second commit interval is the same value used by ext3 ther transferring causal dependencies or buffering output,
mounted asynchronously. Speculator blocks it until it becomes non-speculative.
A final limitation of external synchrony is that modifica- 3.1.2 From speculation to synchronization
tions to data in two different file systems cannot be easily
committed with a single disk transaction. Potentially, we Speculator ties dependency tracking and output buffer-
could share a common journal among all local file sys- ing to other features, such as checkpoint and rollback,
tems, or we could implement a two-phase commit strat- that are not needed to support external synchrony. Worse
egy. However, a simpler solution is to block a process yet, these unneeded features come at a substantial per-
with commit dependencies for one file system before it formance cost. This led us to factor out the functionality
modifies data in a second. Speculator would map each in Speculator common to both speculative execution and
dependency to a specific file system. When a process external synchrony. We modified the Speculator inter-
writes to a file system, Speculator would verify that the face to allow each file system to specify the additional
process depends only on the file system it is modifying; Speculator features that it requires. This allows a single
if it depends on another file system, Speculator would computer to run both a speculative distributed file system
block it until its previous modifications commit. and an externally synchronous local file system.
Both speculative execution and external synchrony en-
3 Implementation force restrictions on when external output may be ob-
served. Speculative execution allows output to be ob-
3.1 External synchrony served based on correctness; output is externalized af-
ter all speculations on which that output depends have
We next provide a brief overview of Speculator [17] and proven to be correct. In contrast, external synchrony al-
how it supports externally synchronous file systems. lows output to be observed based on durability; output is
3.1.1 Speculator background externalized after all file system operations on which that
output depends have been committed to disk.
Speculator improves the performance of distributed file
In external synchrony, a commit dependency represents
systems by hiding the performance cost of remote opera-
the causal relationship between kernel state and an un-
tions. Rather than block during a remote operation, a file
committed file system modification. Any kernel object
system predicts the operation’s result, then uses Specula-
that has one or more associated commit dependencies is
tor to checkpoint the state of the calling process and spec-
referred to as uncommitted. Any external output from a
ulatively continue its execution based on the predicted
process that is uncommitted is buffered within the kernel
result. If the prediction is correct, the checkpoint is dis-
until the modifications on which the output depends have
carded; if it is incorrect, the calling process is restored to
been committed. In other words, uncommitted output is
the checkpoint, and the operation is retried.
never visible to an external observer.
Speculator adds two new data structures to the kernel.
When a process writes to an externally synchronous file
A speculation object tracks all process and kernel state
system, Speculator marks the process as uncommitted. It
that depends on the success or failure of a speculative
also creates a commit dependency between the process
operation. Each speculative object in the kernel has an
and the uncommitted file system transaction that con-
undo log that contains the state needed to undo specu-
tains the modification. When the file system commits the
lative modifications to that object. As processes interact
transaction to disk, the commit dependency is removed.
with kernel objects by executing system calls, Speculator
USENIX Association OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation 5
Once all commit dependencies for buffered output have
been removed, Speculator releases that output to the ex- Process 1234
ternal device to which it was written. When the last com-
Commit Committing
mit dependency for a process is discarded, Speculator Output a Dep 1 FS Op 1
Transaction
marks the process as committed.
Output b Commit FS Op 2 Active
Speculator propagates commit dependencies among ker- Output c Transaction
Dep 2 FS Op 3
nel objects and processes using the same mechanisms it
uses to propagate speculative dependencies. However, Undo Log FS Journal
since external synchrony does not require checkpoint and (a) Data structures with a committing and active transaction
rollback, the propagation of dependencies is consider-
Process 1234
ably easier to implement. For instance, before a pro-
cess inherits a new speculative dependency, Speculator
Output b Commit FS Op 2 Active
must checkpoint its state with a copy-on-write fork. In Dep 2 Transaction
Output c FS Op 3
contrast, when a process inherits a commit dependency,
no checkpoint is needed since the process will never be
rolled back. To support external synchrony, Speculator
maintains the same many-to-many relationship between Undo Log FS Journal
commit dependencies and undo logs as it does for specu- (b) Data structures after the first transaction commits
lations and undo logs. Since commit dependencies are Figure 2. The external synchrony data structures
never rolled back, undo logs need not contain data to
undo the effects of an operation. Therefore, undo logs Figure 2(a), process 1234 has completed three file sys-
in an externally synchronous system only track the re- tem operations, sending output to the screen after each
lationship between commit dependencies and kernel ob- one. Since the output after the first operation triggered
jects and reveal which buffered output can be safely re- a transaction commit, the two following operations were
leased. This simplicity enables Speculator to support placed in a new active transaction. The output is buffered
more forms of interaction among uncommitted processes in the undo log; the commit dependencies maintain the
than it supports for speculative processes. For example, relationship between buffered output and uncommitted
checkpointing multi-threaded processes for speculative data. In Figure 2(b), the first transaction has been com-
execution is a thorny problem [17, 21]. However, as dis- mitted to disk. Therefore, the output that depended upon
cussed in Section 3.5, tracking their commit dependen- the committed transaction has been released to the screen
cies is substantially simpler. and the commit dependency has been discarded.
3.2 File system support for external synchrony Xsyncfs uses journaled mode rather than the default or-
dered mode. This change guarantees ordering; specif-
We modified ext3, a journaling Linux file system, to cre- ically, the property that if an operation A causally pre-
ate xsyncfs. In its default ordered mode, ext3 writes only cedes another operation B, the effects of B should never
metadata modifications to its journal. In its journaled be visible unless the effects of A are also visible. This
mode, ext3 writes both data and metadata modifications. guarantee requires that B never be committed to disk be-
Modifications from many different file system operations fore A. Otherwise, a system crash or power failure may
may be grouped into a single compound journal transac- occur between the two commits — in this case, after the
tion that is committed atomically. Ext3 writes modifica- system is restarted, B will be visible when A is not. Since
tions to the active transaction — at most one transaction journaled mode adds all modifications for A to the jour-
may be active at any given time. A commit of the active nal before the operation completes, those modifications
transaction is triggered when journal space is exhausted, must already be in the journal when B begins (since B
an application performs an explicit synchronization op- causally follows A). Thus, either B is part of the same
eration such as fsync, or the oldest modification in the transaction as A (in which case the ordering property
transaction is more than 5 seconds old. After the transac- holds since A and B are committed atomically), or the
tion starts to commit, the next modification triggers the transaction containing A is already committed before the
creation of a new active transaction. Only one transac- transaction containing B starts to commit.
tion may be committing at any given time, so the next In contrast, the default mode in ext3 does not provide or-
transaction must wait for the commit of the prior trans- dering since data modifications are not journaled. The
action to finish before it commits. kernel may write the dirty blocks of A and B to disk
Figure 2 shows how the external synchrony data struc- in any order as long as the data reaches disk before the
tures change when a process interacts with xsyncfs. In metadata in the associated journal transaction commits.
6 OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Thus, the data modifications for B may be visible after a the completion would trigger a commit if it is a visible
crash without the modifications for A being visible. event). Finally, if the user were to observe the contents of
the file using a different application, e.g., tail, xsyncfs
Xsyncfs informs Speculator when a new journal transac-
tion is created — this allows Speculator to track state that would correctly optimize for latency because Specula-
depends on the uncommitted transaction. Xsyncfs also tor would track the causal relationship through the kernel
data structures from tail to the transaction and provide
informs Speculator when a new modification is added to
the transaction and when the transaction commits. callbacks to xsyncfs. When tail attempts to output data
to the screen, Speculator callbacks will cause xsyncfs to
As described in Section 1, the default behavior of ext3 commit the active transaction.
does not guarantee that modifications are durable after a
power failure. In the Linux 2.4 kernel, durability can be 3.4 Rethinking sync
ensured only by disabling the drive cache. The Linux
2.6.11 kernel provides the option of using write bar- Asynchronous file systems provide explicit synchroniza-
riers to flush the drive cache before and after writing tion operations such as sync and fdatasync for appli-
each transaction commit record. Since Speculator runs cations with durability or ordering constraints. In a syn-
on a 2.4 kernel, we ported write barriers to our kernel chronous file system, such synchronization operations
and modified xsyncfs to use write barriers to guarantee are redundant since ordering and durability are already
that all committed modifications are preserved, even on guaranteed for all file system operations. However, in an
power failure. externally synchronous file system, some extra support
is needed to minimize latency. For instance, a user who
3.3 Output-triggered commits types “sync” in a terminal would prefer that the com-
mand complete as soon as possible.
Xsyncfs uses the causal relationship between disk I/O
and external output to balance the competing concerns When xsyncfs receives a synchronization call such as
sync from the VFS layer, it creates a commit depen-
of throughput and latency. Currently, ext3 commits its
dency between the calling process and the active trans-
journal every 5 seconds, which typically groups the com-
mit of many file system operations. This strategy opti- action. Since this does not require a disk write, the
return from the synchronization call is almost instanta-
mizes for throughput, a logical behavior when writes are
asynchronous. However, latency is an important consid- neous. If a visible event occurs, such as the completion
eration in xsyncfs since users must wait to view output of the sync process, Speculator will issue a callback that
causes xsyncfs to commit the active transaction.
until the transactions on which that output depends com-
mit. If xsyncfs were to use the default ext3 commit strat- External synchrony simplifies the file system abstrac-
egy, disk throughput would be high, but the user might be tion. Since xsyncfs requires no application modifica-
forced to wait up to 5 seconds to see output. This behav- tion, programmers can write the same code that they
ior is clearly unacceptable for interactive applications. would write if they were using a unmodified file sys-
We therefore modified Speculator to support output- tem mounted synchronously. They do not need explicit
synchronization calls to provide ordering and durability
triggered commits. Speculator provides callbacks to
since xsyncfs provides these guarantees by default for all
xsyncfs when it buffers output or blocks a process that
performed a system call for which it cannot track the file system operations. Further, since xsyncfs does not
incur the large performance penalty usually associated
propagation of causal dependencies (e.g., an ioctl).
Xsyncfs uses the ext3 strategy of committing every 5 sec- with synchronous I/O, programmers do not need com-
onds unless it receives a callback that indicates that Spec- plicated group commit strategies to achieve acceptable
performance. Group commit is provided transparently
ulator blocked or buffered output from a process that de-
pends on the active transaction. The receipt of a callback by xsyncfs.
triggers a commit of the active transaction. Of course, a hand-tuned strategy might offer better per-
Output-triggered commits adapt the behavior of the file formance than the default policies provided by xsyncfs.
However, as described in Section 3.3, there are some
system according to the observable behavior of the sys-
tem. For instance, if a user directs output from a running instances in which xsyncfs can optimize performance
when an application solution cannot. Since xsyncfs uses
application to the screen, latency is reduced by commit-
output-triggered commits, it knows when no external
ting transactions frequently. If the user instead redirects
the output to a file, xsyncfs optimizes for throughput by output has been generated that depends on the current
transaction; in these instances, xsyncfs uses group com-
committing every 5 seconds. Optimizing for throughput
is correct in this instance since the only event the user mit to optimize throughput. In contrast, an application-
can observe is the completion of the application (and specific commit strategy cannot determine the visibility
USENIX Association OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation 7
of its actions beyond the scope of the currently executing dependencies for multi-threaded applications — any de-
process; it must therefore conservatively commit modifi- pendency inherited by one thread is inherited by all.
cations before producing external messages.
For example, consider a client that issues two sequential 4 Evaluation
transactions to a database server on the same computer
and then produces output. Xsyncfs can safely group Our evaluation answers the following questions:
the commit of both transactions. However, the database
server (which does not use output-triggered commits) • How does the durability of xsyncfs compare to cur-
must commit each transaction separately since it cannot rent file systems?
know whether or not the client will produce output after • How does the performance of xsyncfs compare to
it is informed of the commit of the first transaction. current file systems?
3.5 Shared memory • How does xsyncfs affect the performance of appli-
cations that synchronize explicitly?
Speculator does not propagate speculative dependencies
when processes interact through shared memory due to • How much do output-triggered commits improve
the complexity of checkpointing at arbitrary states in a the performance of xsyncfs?
process’ execution. Since commit dependencies do not
require checkpoints, we enhanced Speculator to propa- 4.1 Methodology
gate them among processes that share memory.
All computers used in our evaluation have a 3.02 GHZ
Speculator can track causal dependencies because pro-
Pentium 4 processor with 1 GB of RAM. Each computer
cesses can only interact through the operating system. has a single Western Digital WD-XL40 hard drive, which
Usually, this interaction involves an explicit system call is a 7200 RPM 120 GB ATA 100 drive with a 2 MB on-
(e.g., write) that Speculator can intercept. However,
disk cache. The computers run Red Hat Enterprise Linux
when processes interact through shared memory regions, version 3 (kernel version 2.4.21). We use a 400 MB jour-
only the sharing and unsharing of regions is visible to the
nal size for both ext3 and xsyncfs. For each benchmark,
operating system. Thus, Speculator cannot readily inter- we measured ext3 executing in both journaled and or-
cept individual reads and writes to shared memory. dered mode. Since journaled mode executed faster in
We considered marking a shared memory page inaccessi- every benchmark, we report only journaled mode results
ble when a process with write permission inherits a com- in this evaluation. Finally, we measured the performance
mit dependency that a process with read permission does of ext3 both using write barriers and with the drive cache
not have. This would trigger a page fault whenever a pro- disabled. In all cases write barriers were faster than dis-
cess reads or writes the shared page. If a process reads abling the drive cache since the drive cache improves
the page after another writes it, any commit dependen- read times and reduces the frequency of writes to the disk
cies would be transferred from the writer to the reader. platter. Thus, we report only results using write barriers.
Once these processes have the same commit dependen-
cies, the page can be restored to its normal protections. 4.2 Durability
We felt this mechanism would perform poorly because of
the time needed to protect and unprotect pages, as well Our first benchmark empirically confirms that without
as the extra page faults that would be incurred. write barriers, ext3 does not guarantee durability. This
result holds in both journaled and ordered mode, whether
Instead, we decided to use an approach that imposes ext3 is mounted synchronously or asynchronously, and
less overhead but might transfer dependencies when not even if fsync commands are issued by the application
strictly necessary. We make a conservative assumption after every write. Even worse, our results show that, de-
that processes with write permission for a shared mem- spite the use of journaling in ext3, a loss of power can
ory region are continually writing to that region, while corrupt data and metadata stored in the file system.
processes with read permission are continually reading
it. When a process with write permission for a shared We confirmed these results by running an experiment in
region inherits a new commit dependency, any process which a test computer continuously writes data to its lo-
with read permission for that region atomically inherits cal file system. After each write completes, the test com-
the same dependency. puter sends a UDP message that is logged by a remote
computer. During the experiment, we cut power to the
Speculator uses the same mechanism to track commit test computer. After it reboots, we compare the state of
dependencies transfered through memory-mapped files. its file system to the log on the remote computer.
Similarly, Speculator is conservative when propagating
8 OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
File system configuration Data durable on write Data durable on fsync
Asynchronous No Not on power failure
Synchronous Not on power failure Not on power failure
Synchronous with write barriers Yes Yes
External synchrony Yes Yes
This figure describes whether each file system provides durability to the user when an application executes a write or fsync system
call. A “Yes” indicates that the file system provides durability if an OS crash or power failure occurs.
Our goal was to determine when each file system guaran- ext3 provides both durability and ordering when write
tees durability and ordering. We say a file system fails to barriers are combined with some form of synchronous
provide durability if the remote computer logs a message operation (either mounting the file system synchronously
for a write operation, but the test computer is missing or calling fsync after each modification). If write barri-
the data written by that operation. In this case, dura- ers are not available, the equivalent behavior could also
bility is not provided because an external observer (the be achieved by disabling the hard drive cache.
remote computer) saw output that depended on data that The last row of Figure 3 shows results for xsyncfs. As
was subsequently lost. We say a file system fails to pro- expected, xsyncfs provides both durability and ordering.
vide ordering if the state of the file after reboot violates
the temporal ordering of writes. Specifically, for each 4.3 The PostMark benchmark
block in the file, ordering is violated if the file does not
also contain all previously-written blocks. We next ran the PostMark benchmark, which was de-
signed to replicate the small file workloads seen in elec-
For each configuration shown in Figure 3, we ran four tronic mail, netnews, and web-based commerce [8]. We
trials of this experiment: two in journaled mode and used PostMark version 1.5, running in a configuration
two in ordered mode. As expected, our results confirm that creates 10,000 files, performs 10,000 transactions
that ext3 provides durability only when write barriers are consisting of file reads, writes, creates, and deletes, and
used. Without write barriers, synchronous operations en- then removes all files. The PostMark benchmark has a
sure only that modifications are written to the hard drive single thread of control that executes file system oper-
cache. If power fails before the modifications are written ations as quickly as possible. PostMark is a good test
to the disk platter, those modifications are lost. of file system throughput since it does not generate any
Some of our experiments exposed a dangerous behav- output or perform any substantial computation.
ior in ext3: unless write barriers are used, power failures Each bar in Figure 4 shows the time to complete the Post-
can corrupt both data and metadata stored on disk. In Mark benchmark. The y-axis is logarithmic because of
one experiment, a block in the file being modified was the substantial slowdown of synchronous I/O. The first
silently overwritten with garbage data. In another, a sub- bar shows results when ext3 is mounted asynchronously.
stantial amount of metadata in the file system, including As expected, this offers the best performance since the
the superblock, was overwritten with garbage. In the lat- file system buffers data in memory up to 5 seconds before
ter case, the test machine failed to reboot until the file writing it to disk. The second bar shows results using
system was manually repaired. In both cases, corruption xsyncfs. Despite the I/O intensive nature of PostMark,
is caused by the commit block for a transaction being the performance of xsyncfs is within 7% of the perfor-
written to the disk platter before all data blocks in that mance of ext3 mounted asynchronously. After examin-
transaction are written to disk. Although the operating ing the performance of xsyncfs, we determined that the
system wrote the blocks to the drive cache in the correct overhead of tracking causal dependencies in the kernel
order, the hard drive reorders the blocks when writing accounts for most of the difference.
them to the disk platter. After this happens, the trans-
The third bar shows performance when ext3 is mounted
action is committed during recovery even though several
synchronously. In this configuration the writing process
data blocks do not contain valid data. Effectively, this
is blocked until its modifications are committed to the
overwrites disk blocks with uninitialized data.
drive cache. Ext3 in synchronous mode is over an order
Our results also confirm that ext3 without write barriers of magnitude slower than xsyncfs, even though xsyncfs
writes data to disk out of order. Journaled mode alone is provides stronger durability guarantees. Throughput is
insufficient to provide ordering since the order of writing limited by the size of the drive cache; once the cache fills,
transactions to the disk platter may differ from the order subsequent writes block until some data in the cache is
of writing transactions to the drive cache. In contrast, written to the disk platter.
USENIX Association OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation 9
10000
ext3-async
xsyncfs Untar
ext3-sync Configure
1000
1000 ext3-barrier Make
Remove
Time (seconds)
Time (seconds)
100
500
10
1 0
ext3-async xsyncfs ext3-sync ext3-barrier RAMFS
This figure shows the time to run the PostMark benchmark — This figure shows the time to run the Apache build benchmark.
the y-axis is logarithmic. Each value is the mean of 5 trials — Each value is the mean of 5 trials — the (relatively small) error
the (relatively small) error bars are 90% confidence intervals. bars are 90% confidence intervals.
Figure 4. The PostMark file system benchmark Figure 5. The Apache build benchmark
The last bar in Figure 4 shows the time to complete as the data on which it depends commits, output appears
the benchmark when ext3 is mounted synchronously promptly during the execution of the benchmark.
and write barriers are used to prevent data loss when a
For comparison, the bar at the far right of the graph
power failure occurs. Since write barriers synchronously
shows the time to execute the benchmark using a
flush the drive cache twice for each file system transac-
memory-only file system, RAMFS. This provides a
tion, ext3’s performance is over two orders of magnitude
lower bound on the performance of a local file sys-
slower than that of xsyncfs.
tem, and it isolates the computation requirements of the
Due to the high cost of durability, high end storage sys- benchmark. Removing disk I/O by running the bench-
tems sometimes use specialized hardware such as a non- mark in RAMFS improves performance by only 8% over
volatile cache to improve performance [7]. This elim- xsyncfs because the remainder of the benchmark is dom-
inates the need for write barriers. However, even with inated by computation.
specialized hardware, we expect that the performance
The third bar in Figure 5 shows that ext3 mounted in
of ext3 mounted synchronously would be no better than
synchronous mode is 46% slower than xsyncfs. Since
the third bar in Figure 4, which writes data to a volatile
computation dominates I/O in this benchmark, any dif-
cache. Thus, use of xsyncfs should still lead to substan-
ference in I/O performance is a smaller part of overall
tial performance improvements for synchronous opera-
performance. The fourth bar shows that ext3 mounted
tions even when the hard drive has a non-volatile cache
synchronously with write barriers is over 11 times slower
of the same size as the volatile cache on our drive.
than xsyncfs. If we isolate the cost of I/O by subtracting
the cost of computation (calculated using the RAMFS
4.4 The Apache build benchmark
result), ext3 mounted synchronously is 7.5 times slower
We next ran a benchmark in which we untar the Apache than xsyncfs while ext3 mounted synchronously with
2.0.48 source tree into a file system, run configure in write barriers is over two orders of magnitude slower
an object directory within that file system, run make in than xsyncfs. These isolated results are similar to the
the object directory, and remove all files. The Apache values that we saw for the PostMark experiments.
build benchmark requires the file system to balance
throughput and latency; it displays large amounts of 4.5 The MySQL benchmark
screen output interleaved with disk I/O and computation.
We were curious to see how xsyncfs would perform
Figure 5 shows the total amount of time to run the with an application that implements its own group com-
benchmark, with shadings within each bar showing the mit strategy. We therefore ran a modified version of
time for each stage. Comparing the first two bars in the OSDL TPC-C benchmark [18] using MySQL ver-
the graph, xsyncfs performs within 3% of ext3 mounted sion 5.0.16 and the InnoDB storage engine. Since
asynchronously. Since xsyncfs releases output as soon both MySQL and the TPC-C benchmark client are
10 OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
New Order Transactions Per Minute 4000 300
Throughput (Kb/s)
3000
200 ext3-async
2000 xsyncfs
xsyncfs
ext3-barrier ext3-sync
ext3-barrier
1000
100
0
0 5 10 15 20
Number of Threads
This figure shows the New Order Transactions Per Minute when
running a modified TPC-C benchmark on MySQL with varying 0
numbers of clients. Each result is the mean of 5 trials — the
error bars are 90% confidence intervals.
This figure shows the mean throughput achieved when running
Figure 6. The MySQL benchmark the SPECweb99 benchmark with 50 simultaneous connections.
Each result is the mean of three trials, with error bars showing
the highest and lowest result.
multi-threaded, this benchmark measures the efficacy of
xsyncfs’s support for shared memory. TPC-C measures Figure 7. Throughput in the SPECweb99 benchmark
the New Order Transactions Per Minute (NOTPM) a
database can process for a given number of simultane- with write barriers matches the performance of xsyncfs.
ous client connections. The total number of transactions From these results, we conclude that even applications
performed by TPC-C is approximately twice the num- such as MySQL that use a custom group commit strat-
ber of New Order Transactions. TPC-C requires that a egy can benefit from external synchrony if the number of
database provide ACID semantics, and MySQL requires concurrent transactions is low to moderate.
either disabling the drive cache or using write barriers to Although ext3 mounted asynchronously without write
provide durability. Therefore, we compare xsyncfs with barriers does not meet the durability requirements for
ext3 mounted asynchronously using write barriers. Since TPC-C, we were still curious to see how its performance
the client ran on the same machine as the server, we mod- compared to xsyncfs. With only 1 or 2 clients, MySQL
ified the benchmark to use UNIX sockets. This allows executes 11% more NOTPM with xsyncfs than it exe-
xsyncfs to propagate commit dependencies between the cutes with ext3 without write barriers. With 4 or more
client and server on the same machine. In addition, we clients, the two configurations yield equivalent perfor-
modified the benchmark to saturate the MySQL server by mance within experimental error.
removing any wait times between transactions and creat-
ing a data set that fits completely in memory. 4.6 The SPECweb99 benchmark
Figure 6 shows the NOTPM achieved as the number
Since our previous benchmarks measured only work-
of clients is increased from 1 to 20. With a single
loads confined to a single computer, we also ran the
client, MySQL completes 3 times as many NOTPM
SPECweb99 [29] benchmark to examine the impact of
using xsyncfs. By propagating commit dependencies
external synchrony on a network-intensive application.
to both the MySQL server and the requesting client,
In the SPECweb99 benchmark, multiple clients issue a
xsyncfs can group commit transactions from a single
mix of HTTP GET and POST requests. HTTP GET re-
client, significantly improving performance. In contrast,
quests are issued for both static and dynamic content up
MySQL cannot benefit from group commit with a single
to 1 MB in size. A single client, emulating 50 simultane-
client because it must conservatively commit each trans-
ous connections, is connected to the server, which runs
action before replying to the client.
Apache 2.0.48, by a 100 Mb/s Ethernet switch. As we
When there are multiple clients, MySQL can group use the default Apache settings, 50 connections are suf-
the commit of transactions from different clients. As ficient to saturate our server.
the number of clients grows, the gap between xsyncfs
We felt that this benchmark might be especially challeng-
and ext3 mounted asynchronously with write barriers
ing for xsyncfs since sending a network message exter-
shrinks. With 20 clients, xsyncfs improves TPC-C per-
nalizes state. Since xsyncfs only tracks causal dependen-
formance by 22%. When the number of clients reaches
cies on a single computer, it must buffer each message
32, the performance of ext3 mounted asynchronously
USENIX Association OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation 11
Request size ext3-async xsyncfs strategies. The output-triggered commit strategy per-
0–1 KB 0.064 (±0.025) 0.097 (±0.002) forms better than the eager commit strategy in every
1–10 KB 0.150 (±0.034) 0.180 (±0.001) benchmark except SPECweb99, which creates so much
10–100 KB 1.084 (±0.052) 1.094 (±0.003) output that the eager commit and output-triggered com-
100–1000 KB 10.253 (±0.098) 10.072 (±0.066) mit strategies perform very similarly. Since the eager
commit strategy attempts to minimize the latency of a
The figure shows the mean time (in seconds) to request a file of a single operation, it sacrifices the opportunity to improve
particular size during three trials of the SPECweb99 benchmark
with 50 simultaneous connections. 90% confidence intervals are throughput. In contrast, the output-triggered commit
given in parentheses. strategy only minimizes latency after output has been
Figure 8. SPECweb99 latency results generated that depends on a transaction; otherwise it
maximizes throughput.
until the file system data on which that message depends
has been committed. In addition to the normal log data 5 Related work
written by Apache, the SPECweb99 benchmark writes a
To the best of our knowledge, xsyncfs is the first local
log record to the file system as a result of each HTTP
file system to provide high-performance synchronous I/O
POST. Thus, small file writes are common during bench-
without requiring specialized hardware support or appli-
mark execution — a typical 45 minute run has approxi-
cation modification. Further, xsyncfs is the first file sys-
mately 150,000 file system transactions.
tem to use the causal relationship between file modifica-
As shown in Figure 7, SPECweb99 throughput us- tions and external output to decide when to commit data.
ing xsyncfs is within 8% of the throughput achieved
While xsyncfs takes a software-only approach to provid-
when ext3 is mounted asynchronously. In contrast to
ing high-performance synchronous I/O, specialized hard-
ext3, xsyncfs guarantees that the data associated with
ware can achieve the same result. The Rio file cache [2]
each POST request is durable before a client receives
and the Conquest file system [31] use battery-backed
the POST response. The third bar in Figure 7 shows
main memory to make writes persistent. Durability is
that SPECweb99 using ext3 mounted synchronously
guaranteed only as long as the computer has power or
achieves 6% higher throughput than xsyncfs. Unlike the
the batteries remain charged.
previous benchmarks, SPECweb99 writes little data to
disk, so most writes are buffered by the drive cache. The Hitz et al. [7] store file system journal modifications on
last bar shows that xsyncfs achieves 7% better through- a battery-backed RAM drive cache, while writing file
put than ext3 mounted synchronously with write barriers. system data to disk. We expect that synchronous oper-
ations on Hitz’s hybrid system would perform no better
Figure 8 summarizes the average latency of individual
than ext3 mounted synchronously without write barriers
HTTP requests during benchmark execution. On aver-
in our experiments. Thus, xsyncfs could substantially
age, use of xsyncfs adds no more than 33 ms of delay
improve the performance of such hybrid systems.
to each request — this value is less than the commonly
cited perception threshold of 50 ms for human users [5]. eNVy [33] is a file system that stores data on flash-based
Thus, a user should perceive no difference in response NVRAM. The designers of eNVy found that although
time between xsyncfs and ext3 for HTTP requests. reads from NVRAM were fast, writes were prohibitively
slow. They used a battery-backed RAM write cache to
4.7 Benefit of output-triggered commits achieve reasonable write performance. The write perfor-
mance issues seen in eNVy are similar to those we ex-
To measure the benefit of output-triggered commits, we perienced writing data to commodity hard drives. There-
also implemented an eager commit strategy for xsyncfs fore, it is likely that xsyncfs could also improve perfor-
that triggers a commit whenever the file system is mod- mance for flash file systems.
ified. The eager commit strategy still allows for group
Xsyncfs’s focus on providing both strong durability and
commit since multiple modifications are grouped into a
reasonable performance contrasts sharply with the trend
single file system transaction while the previous transac-
in commodity file systems toward relaxing durability
tion is committing. The next transaction will only start
to improve performance. Early file systems such as
to commit once the commit of the previous transaction
FFS [14] and the original UNIX file system [22] in-
completes. The eager commit strategy attempts to mini-
troduced the use of a main memory buffer cache to
mize the latency of individual file system operations.
hold writes until they are asynchronously written to
We executed the previous benchmarks using the eager disk. Early file systems suffered from potential corrup-
commit strategy. Figure 9 compares results for the two tion when a computer lost power or an operating sys-
tem crashed. Recovery often required a time consuming
12 OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Benchmark Eager Commits Output-Triggered Commits Speedup
PostMark (seconds) 9.879 (±0.056) 8.668 (±0.478) 14%
Apache (seconds) 111.41 (±0.32) 109.42 (±0.71) 2%
MySQL 1 client (NOTPM) 3323 (±60) 4498 (±73) 35%
MySQL 20 clients (NOTPM) 3646 (±217) 4052 (±200) 11%
SPECweb99 (Kb/s) 312 (±1) 311(±2) 0%
This figure compares the performance of output-triggered commits with an eager commit strategy. Each result shows the mean of 5
trials, except SPECweb99, which is the mean of 3 trials. 90% confidence intervals are given in parentheses.
examination of the entire state of the file system (e.g., Our implementation of external synchrony draws upon
running fsck). For this reason, file systems such as two other techniques from the fault tolerance literature.
Cedar [6] and LFS [23] added the complexity of a write- First, buffering output until the commit is similar to de-
ahead log to enable fast, consistent recovery of file sys- ferring message sends until commit [12]. Second, track-
tem state. Yet, as was shown in our evaluation, journal- ing causal dependencies to identify what and when to
ing data to a write-ahead log is insufficient to prevent commit is similar to causal tracking in message logging
file system corruption if the drive cache reorders block protocols [4]. We use these techniques in isolation to im-
writes. An alternative to write-ahead logging, Soft Up- prove performance and maintain the appearance of syn-
dates [25], carefully orders disk writes to provide consis- chronous I/O. We also use these techniques in combina-
tent recovery. Xsyncfs builds on this prior work since it tion via output-triggered commits, which automatically
writes data after returning control to the application and balance throughput and latency.
uses a write-ahead log. Thus, external synchrony could Transactions, provided by operating systems such as
improve the performance of synchronous I/O with other
QuickSilver [24], TABS [28], and Locus [32], and by
journaling file systems such as JFS [1] or ReiserFS [16]. transactional file systems [10, 19], also give the strong
Fault tolerance researchers have long defined consis- durability and ordering guarantees that are provided by
tent recovery in terms of the output seen by the outside xsyncfs. In addition, transactions provide atomicity for
world [3, 11, 30]. For example, the output commit prob- a set of file system operations. However, transactional
lem requires that, before a message is sent to the outside systems typically require that applications be modified
world, the state from which that message is sent must be to specify transaction boundaries. In contrast, use of
preserved. In the same way, we argue that the guarantees xsyncfs requires no such modification.
provided by synchronous disk I/O should be defined by
the output seen by the outside world, rather than by the 6 Conclusion
results seen by local processes.
It is challenging to develop simple and reliable software
It is interesting to speculate why the principle of outside
systems if the foundations upon which those systems are
observability is widely known and used in fault tolerance
built are unreliable. Asynchronous I/O is a prime exam-
research yet new to the domain of general purpose appli-
ple of one such unreliable foundation. OS crashes and
cations and I/O. We believe this dichotomy arises from
power failures can lead to loss of data, file system cor-
the different scope and standard of recovery in the two
ruption, and out-of-order modifications. Nevertheless,
domains. In fault tolerance research, the scope of recov-
current file systems present an asynchronous I/O inter-
ery is the entire process; hence not using the principle of
face by default because the performance penalty of syn-
outside observability would require a synchronous disk
chronous I/O is assumed to be too large.
I/O at every change in process state. In general pur-
pose applications, the scope of recovery is only the I/O In this paper, we have proposed a new abstraction, exter-
issued by the application (which can be viewed as an nal synchrony, that preserves the simplicity and reliabil-
application-specific recovery protocol). Hence it is fea- ity of a synchronous I/O interface, yet performs approx-
sible (though still slow) to issue each I/O synchronously. imately as well as an asynchronous I/O interface. Based
In addition, the standard for recovery in fault tolerance on these results, we believe that externally synchronous
research is well defined: a recovery system should lose file systems such as xsyncfs can provide a better founda-
no visible output. In contrast, the standard for recovery tion for the construction of reliable software systems.
in general purpose systems is looser: asynchronous I/O is
common, and even synchronous I/O is usually commit-
ted synchronously only to the volatile hard drive cache.
USENIX Association OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation 13
Acknowledgments [17] N IGHTINGALE , E. B., C HEN , P. M., AND F LINN , J. Spec-
ulative execution in a distributed file system. In Proceedings
We thank Manish Anand, Evan Cooke, Anthony Nicholson, Dan Peek, of the 20th ACM Symposium on Operating Systems Principles
Sushant Sinha, Ya-Yunn Su, our shepherd, Rob Pike, and the anony- (Brighton, United Kingdom, October 2005), pp. 191–205.
mous reviewers for feedback on this paper. The work has been sup- [18] OSDL. OSDL Database Test 2. http://www.osdl.org/.
ported by the National Science Foundation under award CNS-0509093.
Jason Flinn is supported by NSF CAREER award CNS-0346686, and [19] PAXTON , W. H. A client-based transaction system to maintain
Ed Nightingale is supported by a Microsoft Research Student Fellow- data integrity. In Proceedings of the 7th ACM Symposium on Op-
ship. Intel Corp. has provided additional support. The views and con- erating Systems Principles (1979), pp. 18–23.
clusions contained in this document are those of the authors and should [20] P RABHAKARAN , V., BAIRAVASUNDARAM , L. N., A GRAWAL ,
not be interpreted as representing the official policies, either expressed N., G UNAWI , H. S., A RPACI -D USSEAU , A. C., AND A RPACI -
or implied, of NSF, Intel, Microsoft, the University of Michigan, or the D USSEAU , R. H. IRON file systems. In Proceedings of the 20th
U.S. government. ACM Symposium on Operating Systems Principles (Brighton,
United Kingdom, October 2005), pp. 206–220.
References [21] Q IN , F., T UCEK , J., S UNDARESAN , J., AND Z HOU , Y. Rx:
Treating bugs as allergies—a safe method to survive software fail-
[1] B EST, S. JFS overview. Tech. rep., IBM, http://www- ures. In Proceedings of the 20th ACM Symposium on Operating
128.ibm.com/developerworks/linux/library/l-jfs.html, 2000. Systems Principles (Brighton, United Kingdom, October 2005),
pp. 235–248.
[2] C HEN , P. M., N G , W. T., C HANDRA , S., AYCOCK , C., R A -
JAMANI , G., AND L OWELL , D. The Rio file cache: Surviv- [22] R ITCHIE , D. M., AND T HOMPSON , K. The UNIX time-sharing
ing operating system crashes. In Proceedings of the 7th Inter- system. Communications of the ACM 17, 7 (1974), 365–375.
national Conference on Architectural Support for Programming [23] ROSENBLUM , M., AND O USTERHOUT, J. K. The design and im-
Languages and Operating Systems (Cambridge, MA, October plementation of a log-structured file system. ACM Transactions
1996), pp. 74–83. on Computer Systems (TOCS) 10, 1 (February 1992), 26–52.
[3] E LNOZAHY, E. N., A LVISI , L., WANG , Y.-M., AND J OHN - [24] S CHMUCK , F., AND W YLIE , J. Experience with transactions
SON , D. B. A survey of rollback-recovery protocols in message-
in QuickSilver. In Proceedings of the 13th ACM Symposium on
passing systems. ACM Computing Surveys 34, 3 (September Operating Systems Principles (October 1991), pp. 239–53.
2002), 375–408.
[25] S ELTZER , M. I., G ANGER , G. R., M C K USICK , M. K., S MITH ,
[4] E LNOZAHY, E. N., AND Z WAENEPOEL , W. Manetho: Transpar- K. A., S OULES , C. A. N., AND S TEIN , C. A. Journaling versus
ent Rollback-Recovery with Low Overhead, Limited Rollback, soft updates: Asynchronous meta-data protection in file systems.
and Fast Output Commit. IEEE Transactions on Computers C- In USENIX Annual Technical Conference (San Diego, CA, June
41, 5 (May 1992), 526–531. 2000), pp. 18–23.
[5] F LAUTNER , K., AND M UDGE , T. Vertigo: Automatic [26] S ILBERSCHATZ , A., AND G ALVIN , P. B. Operating System
performance-setting for Linux. In Proceedings of the 5th Sympo- Concepts (5th Edition). Addison Wesley, February 1998. p. 27.
sium on Operating Systems Design and Implementation (Boston,
MA, December 2002), pp. 105–116. [27] S LASHDOT. Your Hard Drive Lies to You.
http://hardware.slashdot.org/article.pl?sid=05/05/13/0529252.
[6] H AGMANN , R. Reimplementing the Cedar file system using
logging and group commit. In Proceedings of the 11th ACM [28] S PECTOR , A. Z., DANIELS , D., D UCHAMP, D., E PPINGER ,
Symposium on Operating Systems Principles (Austin, TX, 1987), J. L., AND PAUSCH , R. Distributed transactions for reliable sys-
pp. 155–162. tems. In Proceedings of the 10th ACM Symposium on Operating
Systems Principles (Orcas Island, WA, December 1985), pp. 127–
[7] H ITZ , D., L AU , J., AND M ALCOLM , M. File system design for 146.
an NFS file server appliance. In Proceedings of the Winter 1994
USENIX Technical Conference (1994). [29] S TANDARD P ERFORMANCE E VALUATION C ORPORATION.
SPECweb99. http://www.spec.org/web99.
[8] K ATCHER , J. PostMark: A new file system benchmark. Tech.
Rep. TR3022, Network Appliance, 1997. [30] S TROM , R. E., AND Y EMINI , S. Optimistic Recovery in Dis-
tributed Systems. ACM Transactions on Computer Systems 3, 3
[9] L AMPORT, L. Time, clocks, and the ordering of events in a dis- (August 1985), 204–226.
tributed system. Commun. ACM 21, 7 (1978), 558–565.
[31] WANG , A.-I. A., R EIHER , P., P OPEK , G. J., AND K UENNING ,
[10] L ISKOV, B., AND RODRIGUES , R. Transactional file systems can G. H. Conquest: Better performance through a disk/persistent-
be fast. In Proceedings of the 11th SIGOPS European Workshop RAM hybrid file system. In Proceedings of the 2002 USENIX
(Leuven, Belgium, September 2004). Annual Technical Conference (Monterey, CA, June 2002).
[11] L OWELL , D. E., C HANDRA , S., AND C HEN , P. M. Exploring [32] W EINSTEIN , M. J., T HOMAS W. PAGE , J., L IVEZEY, B. K.,
failure transparency and the limits of generic recovery. In Pro- AND P OPEK , G. J. Transactions and synchronization in a dis-
ceedings of the 4th Symposium on Operating Systems Design and tributed operating system. In Proceedings of the 10th ACM Sym-
Implementation (San Diego, CA, October 2000). posium on Operating Systems Principles (Orcas Island, WA, De-
[12] L OWELL , D. E., AND C HEN , P. M. Persistent messages in local cember 1985), pp. 115–126.
transactions. In Proceedings of the 1998 Symposium on Princi- [33] W U , M., AND Z WAENEPOEL , W. eNVy: A non-volatile,
ples of Distributed Computing (June 1998), pp. 219–226. main memory storage system. In Proceedings of the 6th Inter-
[13] M C K USICK , M. K. Disks from the perspective of a file system. national Conference on Architectural Support for Programming
;login: 31, 3 (June 2006), 18–19. Languages and Operating Systems (San Jose, CA, 1994), pp. 86–
97.
[14] M C K USICK , M. K., J OY, W. N., L EFFLER , S. J., AND FABRY,
R. S. A fast file system for unix. ACM Transactions on Computer
Systems (TOCS) 2, 3 (August 1984), 181–197.
[15] M Y SQL AB. MySQL Reference Manual. http://dev.mysql.com/.
[16] N AMESYS . ReiserFS. http://www.namesys.com/.
14 OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association