ZFS PDF
ZFS PDF
Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum
3
on the next boot. Besides FFS-style file systems remains the same, it takes only 14 more doublings
that depend on repair from fsck[13], metadata to get from 250 bytes to 264 bytes, so 16 EB datasets
logging file systems require a log replay[20], and soft might appear in only 10.5 years. The lifetime of
updates leave behind leaked blocks and inodes that an average file system implementation is measured
must be reclaimed[12]. Log-structured file systems in decades, so we decided to use 128-bit block
periodically create self-consistent checkpoints, but addresses.
the process of creating a checksum is too expensive
to happen frequently[16, 17]. File systems that want to handle 16 EB of data need
more than bigger block addresses, they also need
To see why repair after booting isn’t an acceptable scalable algorithms for directory lookup, metadata
approach to consistency, consider the common case allocation, block allocation, I/O scheduling, and
in which a bootloader reads the root file system in all other routine3 operations. The on-disk format
order to find the files it needs to boot the kernel. deserves special attention to make sure it won’t
If log replay is needed in order to make the file preclude scaling in some fundamental way.
system consistent enough to find those files, then all
the recovery code must also be in the bootloader. It may seem too obvious to mention, but our new file
File systems using soft updates don’t have this system shouldn’t depend on fsck to maintain on-
problem, but they still require repair activity after disk consistency. fsck, the file system checker that
boot to clean up leaked inodes and blocks[12]. scans the entire file system[10], has already fallen
Soft updates are also non-trivial to implement, out of favor in the research community[8, 9, 16, 20],
requiring “detailed information about the relation- but many production file systems still rely on
ship between cached pieces of data,”[18] roll-back fsck at some point in normal usage.4 O(data)
and roll-forward of parts of metadata, and careful operations to repair the file system must be avoided
analysis of each file system operation to determine except in the face of data corruption caused by
the order that updates should appear on disk[12, 18]. unrecoverable physical media errors, administrator
error, or other sources of unexpected external havoc.
The best way to avoid file system corruption due to
system panic or power loss is to keep the data on the
disk self-consistent at all times, as WAFL[8] does. 2.6 Error detection and correction
To do so, the file system needs a simple way to tran-
sition from one consistent on-disk state to another In the ideal world, disks never get corrupted,
without any window of time when the system could hardware RAID never has bugs, and reads always
crash and leave the on-disk data in an inconsistent return the same data that was written. In the
state. The implementation of this needs to be rel- real world, firmware has bugs too. Bugs in disk
atively foolproof and general, so that programmers controller firmware can result in a variety of errors,
can add new features and fix bugs without hav- including misdirected reads, misdirected writes, and
ing to think too hard about maintaining consistency. phantom writes.5 In addition to hardware failures,
file system corruption can be caused by software
or administrative errors, such as bugs in the disk
2.5 Immense capacity driver or turning the wrong partition into a swap de-
vice. Validation at the block interface level can only
Many file systems in use today have 32-bit block ad- catch a subset of the causes of file system corruption.
dresses, and are usually limited to a few terabytes.
As we noted in the introduction, a commodity per- 3 “Routine” includes operations to recover from an unex-
sonal computer can easily hold several terabytes of pected system power cycle, or any other activity that occurs
storage, so 32-bit block addresses are clearly already on boot.
4 For example, RedHat 8.0 supports ext3, which logs meta-
too small. Some recently designed file systems use
data operations and theoretically doesn’t need fsck, but when
64-bit addresses, which will limit them to around booting a RedHat 8.0 system after an unexpected crash, the
16 exabytes (264 bytes = 16 EB). This seems like init scripts offer the option of running fsck on the ext3 file
a lot, but consider that one petabyte (250 bytes) systems anyway. File systems using soft updates also still
require fsck to be run in the background to reclaim leaked
datasets are plausible today[7], and that storage
inodes and blocks[12].
capacity is currently doubling approximately every 5 Phantom writes are when the disk reports that it wrote
9–12 months[21]. Assuming that the rate of growth the block but didn’t actually write it.
4
Traditionally, file systems have trusted the data read few decades. One key aspect of performance is
in from disk. But if the file system doesn’t validate the observation that file system performance is
data read in from disk, the consequences of these increasingly dominated by write performance[16, 8].
errors can be anything from returning corrupted In general, block allocation algorithms should favor
data to a system panic. The file system should writes over reads, and individual small writes should
validate data read in from disk in order to detect be grouped together into large sequential writes
many kinds of file system corruption (unfortunately, rather than scattered over the disk.
it can’t detect those caused by file system bugs).
The file system should also automatically correct
corruption, if possible, by writing the correct block 3 The Zettabyte File System
back to the disk.
The Zettabyte File System (ZFS) is a general pur-
pose file system based on the principles described
2.7 Integration of the volume manager in the last section. ZFS is implemented on the
Solaris operating system and is intended for use on
The traditional way to add features like mirroring everything from desktops to database servers. In
is to write a volume manager that exports a logical this section we give a high level overview of the ZFS
block device that looks exactly like a physical block architecture. As we describe ZFS, we show how our
device. The benefit of this approach is that any design decisions relate to the principles we outlined
file system can use any volume manager and no file in the last section.
system code has to be changed. However, emulating
a regular block device has serious drawbacks: the
block interface destroys all semantic information, 3.1 Storage model
so the volume manager ends up managing on-disk
consistency much more conservatively than it needs The most radical change introduced by ZFS is a re-
to since it doesn’t know what the dependencies division of labor among the various parts of system
between blocks are. It also doesn’t know which software. The traditional file system block diagram
blocks are allocated and which are free, so it must looks something like the left side of Figure 3. The
assume that all blocks are in use and need to be device driver exports a block device interface to
kept consistent and up-to-date. In general, the the volume manager, the volume manager exports
volume manager can’t make any optimizations another block device interface to the file system,
based on knowledge of higher-level semantics. and the file system exports vnode operations6 to
the system call layer.
Many file systems already come with their own
volume managers (VxFS and VxVM, XFS and The ZFS block diagram is the right side of Figure 3.
XLV, UFS and SVM). The performance and effi- Starting from the bottom, the device driver exports
ciency of the entire storage software stack should a block device to the Storage Pool Allocator (SPA).
be improved by changing the interface between The SPA handles block allocation and I/O, and
the file system and volume manager to something exports virtually addressed, explicitly allocated
more useful than the block device interface. The and freed blocks to the Data Management Unit
resulting solution should be lightweight enough (DMU). The DMU turns the virtually addressed
that it imposes virtually no performance penalty in blocks from the SPA into a transactional object
the case of a storage pool containing a single plain interface for the ZFS POSIX Layer (ZPL). Finally,
device. the ZPL implements a POSIX file system on top of
DMU objects, and exports vnode operations to the
system call layer.
2.8 Excellent performance
The block diagrams are arranged so that roughly
Finally, performance should be excellent. Perfor- equivalent functional blocks in the two models are
mance and features are not mutually exclusive. By 6 Vnode operations are part of the VFS (Virtual File Sys-
necessity, we had to start from a clean slate with
tem interface), a generic interface between the operating sys-
ZFS, which allowed us to redesign or eliminate tem and the file system. An example vnode operation is the
crufty old interfaces accumulated over the last vop create() operation, which creates a new file.
5
System Call System Call
Object
File System < dataset, object, offset > Transaction
Interface
Block Data
Device < logical device, offset > < data virtual address > Virtual
Interface Addressing
Block Block
Device < physical device, offset > < physical device, offset > Device
Interface Interface
Figure 3: Traditional file system block diagram (left), vs. the ZFS block diagram (right).
lined up with each other. Note that we have sepa- requests the removal of a device, the SPA can move
rated the functionality of the file system component allocated blocks off the disk by copying them to a
into two distinct parts, the ZPL and the DMU. We new location and changing its translation for the
also replaced the block device interface between the blocks’ DVAs without notifying anyone else.
rough equivalents of the file system and the volume
manager with a virtually addressed block interface. The SPA also simplifies administration. System
administrators no longer have to create logical
devices or partition the storage, they just tell the
3.2 The Storage Pool Allocator SPA which devices to use. By default, each file
system can use as much storage as it needs from its
The Storage Pool Allocator (SPA) allocates blocks storage pool. If necessary, the administrator can set
from all the devices in a storage pool. One system quotas and reservations on file systems or groups of
can have multiple storage pools, although most file systems to control the maximum and minimum
systems will only need one pool. Unlike a volume amount of storage available to each.
manager, the SPA does not present itself as a
logical block device. Instead, it presents itself as The SPA has no limits that will be reached in the
an interface to allocate and free virtually addressed next few decades. It uses 128-bit block addresses,
blocks — basically, malloc() and free() for disk so each storage pool can address up to 256 billion
space. We call the virtual addresses of disk blocks billion billion billion blocks and contain hundreds
data virtual addresses (DVAs). Using virtually of thousands of file systems.7 From the current
addressed blocks makes it easy to implement several state of knowledge in physics, we feel confident that
of our design principles. First, it allows dynamic 128-bit addresses will be sufficient for at least a few
addition and removal of devices from the storage more decades.8
pool without interrupting service. None of the
code above the SPA layer knows where a particular 7 The number of file systems is actually constrained by the
block is physically located, so when a new device design of the operating system (e.g., the 32-bit dev t in the
stat structure returned by the stat(2) family of system calls)
is added, the SPA can immediately start allocating
rather than any limit in ZFS itself.
new blocks from it without involving the rest of 8 Using quantum mechanics, Lloyd[11] calculates that a de-
the file system code. Likewise, when the user vice operating at sub-nuclear energy levels (i.e., it’s still in
6
3.2.2 Virtual devices
The SPA also implements the usual services of a
volume manager: mirroring, striping, concatena-
tion, etc. We wanted to come up with a simple,
modular, and lightweight way of implementing ar-
bitrarily complex arrangements of mirrors, stripes,
concatenations, and whatever else we might think
of. Our solution was a building block approach:
small modular virtual device drivers called vdevs.
Figure 4: ZFS stores checksums in parent indirect A vdev is a small set of routines implementing a
blocks; the root of the tree stores its checksum in itself. particular feature, like mirroring or striping. A
vdev has one or more children, which may be other
vdevs or normal device drivers. For example, a
mirror vdev takes a write request and sends it to all
of its children, but it sends a read request to only
3.2.1 Error detection and correction one (randomly selected) child. Similarly, a stripe
To protect against data corruption, each block is vdev takes an I/O request, figures out which of its
checksummed before it is written to disk. A block’s children contains that particular block, and sends
checksum is stored in its parent indirect block the request to that child only. Most vdevs take
(see Figure 4). As we describe in more detail in only about 100 lines of code to implement; this is
Section 3.3, all on-disk data and metadata is stored because on-disk consistency is maintained by the
in a tree of blocks, rooted at the überblock. The DMU, rather than at the block allocation level.
überblock is the only block that stores its checksum
in itself. Keeping checksums in the parent of a Each storage pool contains one or more top-level
block separates the data from its checksum on vdevs, each of which is a tree of vdevs of arbi-
disk and makes simultaneous corruption of data trary depth. Each top-level vdev is created with
and checksum less likely. It makes the checksums a single command using a simple nested descrip-
self-validating because each checksum is itself tion language. The syntax is best described with
checksummed by its parent block. Another benefit an example: To make a pool containing a sin-
is that the checksum doesn’t need to be read in gle vdev that is a two-way mirror of /dev/dsk/a
from a separate block, since the indirect block and /dev/dsk/b, run the command “zpool create
was read in to get to the data in the first place. mirror(/dev/dsk/a,/dev/dsk/b)”. For readabil-
The checksum function is a pluggable module; by ity, we also allow a more relaxed form of the syn-
default the SPA uses 64-bit Fletcher checksums[6]. tax when there is no ambiguity, e.g., “zpool create
Checksums are verified whenever a block is read in mirror /dev/dsk/a /dev/dsk/b”. Figure 5 shows
from disk and updated whenever a block is written an example of a possible vdev tree constructed by
out to disk. Since all data in the pool, including all an administrator who was forced to cobble together
metadata, is in the tree of blocks, everything ZFS 100 GB of mirrored storage out of two 50 GB disks
ever writes to disk is checksummed. and one 100 GB disk — hopefully an uncommon sit-
uation.
Checksums also allow data to be self-healing in
some circumstances. When the SPA reads in a block
from disk, it has a high probability of detecting any 3.2.3 Block allocation strategy
data corruption. If the storage pool is mirrored,
the SPA can read the good copy of the data and The SPA allocates blocks in a round-robin fashion
repair the damaged copy automatically (presuming from the top-level vdevs. A storage pool with
the storage media hasn’t totally malfunctioned). multiple top-level vdevs allows the SPA to use
dynamic striping9 to increase disk bandwidth.
Since a new block may be allocated from any of
the form of atoms) could store a maximum of 1025 bits/kg. the top-level vdevs, the SPA implements dynamic
The minimum mass of a device capable of storing 2128 bytes
would be about 272 trillion kg; for comparison, the Empire 9 The administrator may also configure “static” striping if
7
3.3 The Data Management Unit
mirror
The next component of ZFS is the Data Manage-
ment Unit (DMU). The DMU consumes blocks from
concat disk the SPA and exports objects (flat files). Objects
live within the context of a particular dataset. A
dataset provides a private namespace for the objects
disk disk contained by the dataset. Objects are identified by
100G 64-bit numbers, contain up to 264 bytes of data, and
can be created, destroyed, read, and written. Each
write to (or creation of or destruction of) a DMU
50G 50G
object is assigned to a particular transaction10 by
the caller.
Figure 5: Example vdev with the description
mirror(concat(/dev/dsk/a,/dev/dsk/b),/dev/dsk/c) The DMU keeps the on-disk data consistent at
where disks a and b are the 50 GB disks and disk c is all times by treating all blocks as copy-on-write.
the 100 GB disk. All data in the pool is part of a tree of indirect
blocks, with the data blocks as the leaves of the
striping by spreading out writes across all available tree. The block at the root of the tree is called
top-level vdevs at whatever granularity is conve- the überblock. Whenever any part of a block is
nient (remember, blocks are virtually addressed, written, a new block is allocated and the entire
so the SPA don’t need a fixed stripe width to modified block is copied into it. Since the indirect
calculate where a block is located). As a result, block must be written in order to record the new
reads also tend to spread out across all top-level location of the data block, it must also be copied
vdevs. When a new device is added to the stor- to a new block. Newly written indirect blocks
age pool, the SPA immediately begins allocating “ripple” all the way up the tree to the überblock.
blocks from it, increasing the total disk band- See Figures 6–8.
width without any further intervention (such as
creating a new stripe group) from the administrator. When the DMU gets to the überblock at the root
of the tree, it rewrites it in place, in effect switch-
The SPA uses a derivative of the slab allocator[2] ing atomically from the old tree of blocks to the
to allocate blocks. Storage is divided up into new tree of blocks. In case the rewrite does not
metaslabs, which are in turn divided into blocks of complete correctly, the überblock has an embedded
a particular size. We chose to use different sizes of checksum that detects this form of failure, and the
blocks rather than extents in part because extents DMU will read a backup überblock from another
aren’t amenable to copy-on-write techniques and location. Transactions are implemented by writing
because the performance benefits of extents are out all the blocks involved and then rewriting the
achievable with a block-based file system[14]. For überblock once for the entire transaction. For effi-
good performance, a copy-on-write file system needs ciency, the DMU groups many transactions together,
to find big chunks of contiguous free space to write so the überblock and other indirect blocks are only
new blocks to, and the slab allocator already has a rewritten once for many data block writes. The
proven track record of efficiently preventing frag- threshold for deciding when to write out a group of
mentation of memory in the face of variable sized transactions is based on both time (each transaction
allocations without requiring a defragmentation waiths a maximum of a few seconds) and amount of
thread. By contrast, log-structured file systems changes built up.
require the creation of contiguous 512KB or 1MB
segments of free disk space[16, 17], and the overhead The DMU fulfills several more of our design prin-
of the segment cleaner that produced these segments ciples. The transactional object-based interface
turned out to be quite significant in some cases[19] provides a simple and generic way for any file system
(although recent work[22] has reduced the cleaning 10 A transaction is a group of changes with the guarantee
load). Using the slab allocator allows us much more
that either of all of the changes will complete (be visible on-
freedom with our block allocation strategy. disk) or none of the changes will complete, even in the face
of hardware failure or system crash.
8
the on-disk inode, and free the object containing
the file’s data, in any order, and (3) commit the
transaction.11 In ZFS, on-disk consistency is im-
plemented exactly once for all file system operations.
9
the event of a crash. Now, look at the pool we just created:
Instead of using the mkfs program to create file # zpool info home
Pool size used avail capacity
systems, the ZPL creates each new file system
home 80G 409M 80G 1%
itself. To do so, the ZPL creates and populates a
few DMU objects with the necessary information
Let’s create and mount several file systems, one for
(e.g., the inode for the root directory). This is a
each user’s home directory, using the “-c” or “create
constant-time operation, no matter how large the
new file system” option:
file system will eventually grow. Overall, creating
a new file system is about as complicated and # zfs mount -c home/user1 /export/home/user1
resource-consuming as creating a new directory. In # zfs mount -c home/user2 /export/home/user2
ZFS, file systems are cheap. # zfs mount -c home/user3 /export/home/user3
The attentive reader will have realized that our Now, verify that they’ve been created and mounted:
implementation of transactions will lose the last
few seconds of writes if the system unexpectedly # df -h -F zfs
Filesystem size used avail use% Mounted on
crashes. For applications where those seconds home/user1 80G 4K 80G 1% /export/home/user1
matter (such as NFS servers), the ZPL includes an home/user2 80G 4K 80G 1% /export/home/user2
intent log to record the activity since the last group home/user3 80G 4K 80G 1% /export/home/user3
of transactions was committed. The intent log is
not necessary for on-disk consistency, only to re- Notice that we just created several file systems with-
cover uncommitted writes and provide synchronous out making a partition for each one or running
semantics. Placing the intent log at the level of the newfs. Let’s add some new disks to the storage pool:
ZPL allows us to record entries in the more compact
form of “create the file ‘foo’ in the directory whose # zpool add home mirror /dev/dsk/c3t1d0s0 \
inode is 27” rather than “set these blocks to these /dev/dsk/c5t1d0s0
# df -h -F zfs
values.” The intent log can log either to disk or to Filesystem size used avail use% Mounted on
NVRAM. home/user1 160G 4K 160G 1% /export/home/user1
home/user2 160G 4K 160G 1% /export/home/user2
home/user3 160G 4K 160G 1% /export/home/user3
4 ZFS in action As you can see, the new space is now available to
all of the file systems.
All of this high-level architectural stuff is great,
but what does ZFS actually look like in practice? Let’s copy some files into one of the file systems, then
In this section, we’ll use a transcript (slightly simulate data corruption by writing random garbage
edited for two-column format) of ZFS in action to to one of the disks:
demonstrate three of the benefits we claim for ZFS:
simplified administration, virtualization of storage, # cp /usr/bin/v* /export/home/user1
and detection and correction of data corruption. # dd if=/dev/urandom of=/dev/rdsk/c3t0d0s0 \
First, we’ll create a storage pool and several ZFS count=10000
10000+0 records in
file systems. Next, we’ll add more storage to the
10000+0 records out
pool dynamically and show that the file systems
start using the new space immediately. Then
Now, diff12 all of the files with the originals in
we’ll deliberately scribble garbage on one side of a
/usr/bin:
mirror while it’s in active use, and show that ZFS
automatically detects and corrects the resulting # diff /export/home/user1/vacation /usr/bin/vacation
data corruption. [...]
# diff /export/home/user1/vsig /usr/bin/vsig
First, let’s create a storage pool named “home” us-
ing a mirror of two disks: No corruption! Let’s look at some statistics to see
how many errors ZFS detected and repaired:
# zpool create home mirror /dev/dsk/c3t0d0s0 \
/dev/dsk/c5t0d0s0 12 Using a version of diff that works on binary files, of course.
10
# zpool info -v home their disks,13 ZFS allows them to select cheaper
I/O per sec I/O errors
vdev description read write found fixed checksum functions or turn them off entirely. Even
1 mirror(2,3) 0 0 82 82 the highest quality disks can’t detect adminis-
2 /dev/dsk/c3t0d0s0 0 0 82 0 tration or software errors, so we still recommend
3 /dev/dsk/c5t0d0s0 0 0 0 0 checksums to protect against errors such as acci-
4 mirror(5,6) 0 0 0 0
5 /dev/dsk/c3t1d0s0 0 0 0 0 dentally configuring the wrong disk as a swap device.
6 /dev/dsk/c5t1d0s0 0 0 0 0
Checksums actually speed up ZFS in some cases.
As you can see, 82 errors were originally found in the They allow us to cheaply validate data read from
first child disk. The child disk passed the error up one side of a mirror without reading the other side.
to the parent mirror vdev, which then reissued the When two mirror devices are out of sync, ZFS
read request to its second child. The checksums on knows which side is correct and can repair the bad
the data returned by the second child were correct, side of the mirror. Since ZFS knows which blocks of
so the mirror vdev returned the good data and also a mirror are in use, it can add new sides to a mirror
rewrote the bad copy of the data on the first child by duplicating only the data in the allocated blocks,
(the “fixed” column in the output would be better rather than copying gigabytes of garbage from one
named “fixed by this vdev”). disk to another.
11
server appliances. WAFL, which stands for Write a matter of days. Encryption at the block level will
Anywhere File Layout, was the first commercial be trivial to implement, once we have a solution
file system to use the copy-on-write tree of blocks to the more difficult problem of a suitable key
approach to file system consistency. Both WAFL management infrastructure. Now that file systems
and Episode[4] store metadata in files. WAFL also are cheap and easy to create, they appear to be a
logs operations at the file system level rather than logical administrative unit for permissions, ACLs,
the block level. ZFS differs from WAFL in its use of encryption, compression and more. ZFS supports
pooled storage and the storage pool allocator, which multiple block sizes, but our algorithm for choosing
allows file systems to share storage without knowing a block size is in its infancy. We would like to auto-
anything about the underlying layout of storage. matically detect the block size used by applications
WAFL uses a checksum file to hold checksums for (such as databases) and use the same block size
all blocks, whereas ZFS’s checksums are in the internally, in order to avoid read-modify-write of
indirect blocks, making checksums self-validating file system blocks. Block allocation in general is an
and eliminating an extra block read. Finally, ZFS area with many possibilities to explore.
is a general purpose UNIX file system, while WAFL
is currently only used inside network appliances.
12
References [14] Larry W. McVoy and Steve R. Kleiman. Extent-like
performance from a UNIX file system. In Proceed-
[1] Steve Best. How the journaled file system cuts ings of the 1991 USENIX Winter Technical Confer-
system restart times to the quick. http://www- ence, 1991.
106.ibm.com/developerworks/linux/library/l- [15] David A. Patterson, A. Brown, P. Broadwell,
jfs.html, January 2000. G. Candea, M. Chen, J. Cutler, P. Enriquez,
A. Fox, E. Kiciman, M. Merzbacher, D. Oppen-
[2] Jeff Bonwick. The slab allocator: An object-caching
heimer, N. Sastry, W. Tetzlaff, J. Traupman, and
kernel memory allocator. In Proceedings of the 1994
N. Treuhaft. Recovery-oriented computing (ROC):
USENIX Summer Technical Conference, 1994.
Motivation, definition, techniques, and case studies.
[3] Aaron Brown and David A. Patterson. To err is Technical Report UCB//CSD-02-1175, UC Berke-
human. In Proceedings of the First Workshop on ley Computer Science Technical Report, 2002.
Evaluating and Architecting System dependabilitY
[16] Mendel Rosenblum and John K. Ousterhout. The
(EASY ’01), 2001.
design and implementation of a log-structured file
[4] Sailesh Chutani, Owen T. Anderson, Michael L. system. ACM Transactions on Computer Systems,
Kazar, Bruce W. Leverett, W. Anthony Mason, and 10(1):26–52, 1992.
Robert N. Sidebotham. The Episode file system. In [17] Margo Seltzer, Keith Bostic, Marshall K. McKu-
Proceedings of the 1992 USENIX Winter Technical sick, and Carl Staelin. An implementation of a
Conference, 1992. log-structured file system for UNIX. In Proceedings
[5] John R. Douceur and William J. Bolosky. A large- of the 1993 USENIX Winter Technical Conference,
scale study of file-system contents. In Proceedings of 1993.
the International Conference on Measurement and [18] Margo Seltzer, Greg Ganger, M. Kirk McKu-
Modeling of Computer Systems, 1999. sick, Keith Smith, Craig Soules, and Christopher
[6] John G. Fletcher. An arithmetic checksum for serial Stein. Journaling versus soft updates: Asyn-
transmissions. IEEE Transactions on Communica- chronous meta-data protection in file systems. In
tions, COM-30(1):247–252, 1982. Proceedings of the 2000 USENIX Technical Confer-
ence, 2000.
[7] Andrew Hanushevsky and Marcia Nowark. Pur-
suit of a scalable high performance multi-petabyte [19] Margo Seltzer, Keith Smith, Hari Balakrishnan,
database. In IEEE Symposium on Mass Storage Jacqueline Chang, Sara McMains, and Venkata N.
Systems, 1999. Padmanabhan. File system logging versus cluster-
ing: A performance comparison. In Proceedings
[8] Dave Hitz, James Lau, and Michael Malcolm. File of the 1995 USENIX Winter Technical Conference,
system design for an NFS file server appliance. In 1995.
Proceedings of the 1994 USENIX Winter Technical
Conference, 1994. [20] Adam Sweeney, Doug Doucette, Wei Hu, Curtis An-
derson, Michael Nishimoto, and Geoff Peck. Scala-
[9] Michael L. Kazar, Bruce W. Leverett, Owen T. An- bility in the XFS file system. In Proceedings of the
derson, Vasilis Apostolides, Beth A. Bottos, Sailesh 1996 USENIX Technical Conference, 1996.
Chutani, Craig F. Everhart, W. Anthony Mason,
[21] Jon William Toigo. Avoiding a data crunch. Scien-
Shu-Tsui Tu, and Edward R. Zayas. DEcorum file
tific American, May 2000.
system architectural overview. In Proceedings of the
1990 USENIX Summer Technical Conference, 1990. [22] Jun Wang and Yiming Hu. WOLF—a novel re-
ordering write buffer to boost the performance of
[10] T. J. Kowalski and Marshall K. McKusick. Fsck
log–structured file systems. In Proceedings of the
- the UNIX file system check program. Technical 2002 USENIX File Systems and Storage Technolo-
report, Bell Laboratories, March 1978. gies Conference, 2002.
[11] Seth Lloyd. Ultimate physical limits to computa-
tion. Nature, 406:1047–1054, 2000.
[12] M. Kirk McKusick and Gregory R. Ganger. Soft
updates: A technique for eliminating most syn-
chronous writes in the fast filesystem. In Proceed-
ings of the 1999 USENIX Technical Conference -
Freenix Track, 1999.
[13] M. Kirk McKusick, William N. Joy, Samuel J. Lef-
fler, and Robert S. Fabry. A fast file system for
UNIX. Computer Systems, 2(3):181–197, 1984.
13