0% found this document useful (0 votes)
90 views

ZFS PDF

The document describes a new file system called ZFS that was designed to address problems with traditional file systems. ZFS provides strong data integrity through checksumming of all data, simplifies administration by pooling storage and allowing dynamic resizing of file systems, and enables immense storage capacities. It implements these features through a transactional copy-on-write model, object-based storage, and dividing responsibilities between the file system and volume manager. The goals were strong integrity, simplicity, and large capacity.

Uploaded by

Umesh Walunjkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

ZFS PDF

The document describes a new file system called ZFS that was designed to address problems with traditional file systems. ZFS provides strong data integrity through checksumming of all data, simplifies administration by pooling storage and allowing dynamic resizing of file systems, and enables immense storage capacities. It implements these features through a transactional copy-on-write model, object-based storage, and dividing responsibilities between the file system and volume manager. The goals were strong integrity, simplicity, and large capacity.

Uploaded by

Umesh Walunjkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

The Zettabyte File System

Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum

Jeff[email protected]

Abstract Today’s storage environment is radically different


from that of the 1980s, yet many file systems
In this paper we describe a new file system that continue to adhere to design decisions made when
provides strong data integrity guarantees, simple 20 MB disks were considered big. Today, even a per-
administration, and immense capacity. We show sonal computer can easily accommodate 2 terabytes
that with a few changes to the traditional high-level of storage — that’s one ATA PCI card and eight
file system architecture — including a redesign of 250 GB IDE disks, for a total cost of about $2500
the interface between the file system and volume (according to http://www.pricewatch.com) in late
manager, pooled storage, a transactional copy-on- 2002 prices. Disk workloads have changed as a result
write model, and self-validating checksums — we of aggressive caching of disk blocks in memory; since
can eliminate many drawbacks of traditional file reads can be satisfied by the cache but writes must
systems. We describe a general-purpose production- still go out to stable storage, write performance
quality file system based on our new architecture, now dominates overall performance[8, 16]. These
the Zettabyte File System (ZFS). Our new architec- are only two of the changes that have occurred
ture reduces implementation complexity, allows new over the last few decades, but they alone warrant a
performance improvements, and provides several reexamination of file system architecture from the
useful new features almost as a side effect. ground up.

We began the design of ZFS with the goals of


1 Introduction strong data integrity, simple administration, and
Upon hearing about our work on ZFS, some people immense capacity. We decided to implement ZFS
appear to be genuinely surprised and ask, “Aren’t from scratch, subject only to the constraint of
local file systems a solved problem?” From this POSIX compliance. The resulting design includes
question, we can deduce that the speaker has pooled storage, checksumming of all on-disk data,
probably never lost important files, run out of transactional copy-on-write update of all blocks,
space on a partition, attempted to boot with a an object-based storage model, and a new divi-
damaged root file system, needed to repartition a sion of labor between the file system and volume
disk, struggled with a volume manager, spent a manager. ZFS turned out to be simpler to imple-
weekend adding new storage to a file server, tried ment than many recent file systems, which is a
to grow or shrink a file system, mistyped something hopeful sign for the long-term viability of our design.
in /etc/fstab, experienced silent data corruption,
or waited for fsck to finish. Some people are lucky In this paper we examine current and upcoming
enough to never encounter these problems because problems for file systems, describe our high level
they are handled behind the scenes by system design goals, and present the Zettabyte File System.
administrators. Others accept such inconveniences In Section 2 we explain the design principles ZFS
as inevitable in any file system. While the last is based on. Section 3 gives an overview of the
few decades of file system research have resulted implementation of ZFS. Section 4 demonstrates
in a great deal of progress in performance and ZFS in action. Section 5 discusses the design
recoverability, much room for improvement remains tradeoffs we made. Section 6 reviews related work.
in the areas of data integrity, availability, ease of Section 7 describes future avenues for research,
administration, and scalability. and Section 8 summarizes the current status of ZFS.
2 Design principles data during file system repair.

In this section we describe the design principles we


used to design ZFS, based on our goals of strong 2.2 Pooled storage
data integrity, simple administration, and immense
capacity. One of the most striking design principles in modern
file systems is the one-to-one association between
a file system and a particular storage device (or
2.1 Simple administration portion thereof). Volume managers do virtualize
the underlying storage to some degree, but in
On most systems, partitioning a disk, creating a the end, a file system is still assigned to some
logical device, and creating a new file system are particular range of blocks of the logical storage
painful and slow operations. There isn’t much device. This is counterintuitive because a file
pressure to simplify and speed up these kinds of system is intended to virtualize physical storage,
administrative tasks because they are relatively and yet there remains a fixed binding between
uncommon and only performed by system admin- a logical namespace and a specific device (logi-
istrators. At the same time though, mistakes or cal or physical, they both look the same to the user).
accidents during these sorts of administrative tasks
are fairly common, easy to make, and can destroy To make the problem clearer, let’s look at the
lots of data quickly[3]. The fact that these tasks are analogous problem for main memory. Imagine that
relatively uncommon is actually an argument for every time an administrator added more memory
making them easier — almost no one is an expert at to a system, he or she had to run the “formatmem”
these tasks precisely because they are so uncommon. command to partition the new memory into chunks
As more and more people become their own system and assign those chunks to applications. Admin-
administrators, file systems programmers can no istrators don’t have to do this because virtual
longer assume a qualified professional is at the addressing and dynamic memory allocators take
keyboard (which is never an excuse for a painful care of it automatically.
user interface in the first place).
Similarly, file systems should be decoupled from
Administration of storage should be simplified physical storage in much the same way that vir-
and automated to a much greater degree. Manual tual memory decouples address spaces from memory
configuration of disk space should be unnecessary. banks. Multiple file systems should be able to share
If manual configuration is desired, it should be easy, one pool of storage. Allocation should be moved out
intuitive, and quick. The administrator should be of the file system and into a storage space alloca-
able to add more storage space to an existing file tor that parcels out permanent storage space from
system without unmounting, locking, or otherwise a pool of storage devices as file systems request it.
interrupting service on that file system. Removing Contrast the traditional volume approach in Fig-
storage should be just as easy. Administering ure 1 with the pooled storage approach in Figure 2.
storage space should be simple, fast, and difficult to Logical volumes are a small step in this direction,
screw up. but they still look like a range of bytes that must be
partitioned before use.
The overall goal is to allow the administrator to
state his or her intent (“make a new file system”) The interface between the file system and the
rather than the details to implement it (find unused volume manager puts allocation on the wrong side
disk, partition it, write the on-disk format, etc.). of the interface, making it difficult to grow and
Adding more layers of automation, such as GUIs, shrink file systems, share space, or migrate live data.
over existing file systems won’t solve the problem
because the file system is the unit of administration.
Hiding the division of files into file systems by 2.3 Dynamic file system size
covering it up with more layers of user interface
won’t solve underlying problems like files that are If a file system can only use space from its partition,
too large for their partitions, static allocation of the system administrator must then predict (i.e.,
disk space to individual file systems or unavailable guess) the maximum future size of each file system
2
at the time of its creation. Some file systems
solve this problem by providing programs to grow
and shrink file systems, but they only work under
certain circumstances, are slow, and must be run
by hand (rather than occurring automatically).
They also require a logical volume manager with
FS FS FS the capability to grow and shrink logical partitions.

Besides the out-of-space scenario, this also pre-


Naming vents the creation of a new file system on a fully
and partitioned disk.1 Most of the disk space may be
storage
Volume Volume Volume unused, but if no room is left for a new partition,
tightly (Virtual Disk) (Virtual Disk) (Virtual Disk)
bound no new file systems may be created. While the
growth of disk space has made this limitation less
of a concern — users can afford to waste lots of
disk space because many systems aren’t using it
anyway[5] — partitioning of disk space remains an
administrative headache.
No space sharing

Once the storage used by a file system can grow


Figure 1: Storage divided into volumes and shrink dynamically through the addition and
removal of devices, the next step is a file system
that can grow and shrink dynamically as users add
or remove data. This process should be entirely
automatic and should not require administrator
intervention. If desired, the administrator should be
able to set quotas and reservations on file systems
or groups of file systems in order to prevent unfair
usage of the storage pool.

Since file systems have no fixed size, it no longer


makes sense to statically allocate file system meta-
data at file system creation time. In any case,
FS FS FS administrators shouldn’t be burdened with a task
that the file system can perform automatically.
Instead, structures such as inodes2 should be
Naming allocated and freed dynamically as files are created
and and deleted.
storage Storage Pool
decoupled

2.4 Always consistent on-disk data

Most file systems today still allow the on-disk data


to be inconsistent in some way for varying periods
All space shared of time. If an unexpected crash or power cycle
happens while the on-disk state is inconsistent,
Figure 2: Pooled storage the file system will require some form of repair
1 Some might wonder why multiple file systems sharing
pooled storage is preferable to one large growable file system.
Briefly, the file system is a useful point of administration (for
backup, mount, etc.) and provides fault isolation.
2 An inode is the structure used by a file system to store

metadata associated with a file, such as its owner.

3
on the next boot. Besides FFS-style file systems remains the same, it takes only 14 more doublings
that depend on repair from fsck[13], metadata to get from 250 bytes to 264 bytes, so 16 EB datasets
logging file systems require a log replay[20], and soft might appear in only 10.5 years. The lifetime of
updates leave behind leaked blocks and inodes that an average file system implementation is measured
must be reclaimed[12]. Log-structured file systems in decades, so we decided to use 128-bit block
periodically create self-consistent checkpoints, but addresses.
the process of creating a checksum is too expensive
to happen frequently[16, 17]. File systems that want to handle 16 EB of data need
more than bigger block addresses, they also need
To see why repair after booting isn’t an acceptable scalable algorithms for directory lookup, metadata
approach to consistency, consider the common case allocation, block allocation, I/O scheduling, and
in which a bootloader reads the root file system in all other routine3 operations. The on-disk format
order to find the files it needs to boot the kernel. deserves special attention to make sure it won’t
If log replay is needed in order to make the file preclude scaling in some fundamental way.
system consistent enough to find those files, then all
the recovery code must also be in the bootloader. It may seem too obvious to mention, but our new file
File systems using soft updates don’t have this system shouldn’t depend on fsck to maintain on-
problem, but they still require repair activity after disk consistency. fsck, the file system checker that
boot to clean up leaked inodes and blocks[12]. scans the entire file system[10], has already fallen
Soft updates are also non-trivial to implement, out of favor in the research community[8, 9, 16, 20],
requiring “detailed information about the relation- but many production file systems still rely on
ship between cached pieces of data,”[18] roll-back fsck at some point in normal usage.4 O(data)
and roll-forward of parts of metadata, and careful operations to repair the file system must be avoided
analysis of each file system operation to determine except in the face of data corruption caused by
the order that updates should appear on disk[12, 18]. unrecoverable physical media errors, administrator
error, or other sources of unexpected external havoc.
The best way to avoid file system corruption due to
system panic or power loss is to keep the data on the
disk self-consistent at all times, as WAFL[8] does. 2.6 Error detection and correction
To do so, the file system needs a simple way to tran-
sition from one consistent on-disk state to another In the ideal world, disks never get corrupted,
without any window of time when the system could hardware RAID never has bugs, and reads always
crash and leave the on-disk data in an inconsistent return the same data that was written. In the
state. The implementation of this needs to be rel- real world, firmware has bugs too. Bugs in disk
atively foolproof and general, so that programmers controller firmware can result in a variety of errors,
can add new features and fix bugs without hav- including misdirected reads, misdirected writes, and
ing to think too hard about maintaining consistency. phantom writes.5 In addition to hardware failures,
file system corruption can be caused by software
or administrative errors, such as bugs in the disk
2.5 Immense capacity driver or turning the wrong partition into a swap de-
vice. Validation at the block interface level can only
Many file systems in use today have 32-bit block ad- catch a subset of the causes of file system corruption.
dresses, and are usually limited to a few terabytes.
As we noted in the introduction, a commodity per- 3 “Routine” includes operations to recover from an unex-
sonal computer can easily hold several terabytes of pected system power cycle, or any other activity that occurs
storage, so 32-bit block addresses are clearly already on boot.
4 For example, RedHat 8.0 supports ext3, which logs meta-
too small. Some recently designed file systems use
data operations and theoretically doesn’t need fsck, but when
64-bit addresses, which will limit them to around booting a RedHat 8.0 system after an unexpected crash, the
16 exabytes (264 bytes = 16 EB). This seems like init scripts offer the option of running fsck on the ext3 file
a lot, but consider that one petabyte (250 bytes) systems anyway. File systems using soft updates also still
require fsck to be run in the background to reclaim leaked
datasets are plausible today[7], and that storage
inodes and blocks[12].
capacity is currently doubling approximately every 5 Phantom writes are when the disk reports that it wrote

9–12 months[21]. Assuming that the rate of growth the block but didn’t actually write it.

4
Traditionally, file systems have trusted the data read few decades. One key aspect of performance is
in from disk. But if the file system doesn’t validate the observation that file system performance is
data read in from disk, the consequences of these increasingly dominated by write performance[16, 8].
errors can be anything from returning corrupted In general, block allocation algorithms should favor
data to a system panic. The file system should writes over reads, and individual small writes should
validate data read in from disk in order to detect be grouped together into large sequential writes
many kinds of file system corruption (unfortunately, rather than scattered over the disk.
it can’t detect those caused by file system bugs).
The file system should also automatically correct
corruption, if possible, by writing the correct block 3 The Zettabyte File System
back to the disk.
The Zettabyte File System (ZFS) is a general pur-
pose file system based on the principles described
2.7 Integration of the volume manager in the last section. ZFS is implemented on the
Solaris operating system and is intended for use on
The traditional way to add features like mirroring everything from desktops to database servers. In
is to write a volume manager that exports a logical this section we give a high level overview of the ZFS
block device that looks exactly like a physical block architecture. As we describe ZFS, we show how our
device. The benefit of this approach is that any design decisions relate to the principles we outlined
file system can use any volume manager and no file in the last section.
system code has to be changed. However, emulating
a regular block device has serious drawbacks: the
block interface destroys all semantic information, 3.1 Storage model
so the volume manager ends up managing on-disk
consistency much more conservatively than it needs The most radical change introduced by ZFS is a re-
to since it doesn’t know what the dependencies division of labor among the various parts of system
between blocks are. It also doesn’t know which software. The traditional file system block diagram
blocks are allocated and which are free, so it must looks something like the left side of Figure 3. The
assume that all blocks are in use and need to be device driver exports a block device interface to
kept consistent and up-to-date. In general, the the volume manager, the volume manager exports
volume manager can’t make any optimizations another block device interface to the file system,
based on knowledge of higher-level semantics. and the file system exports vnode operations6 to
the system call layer.
Many file systems already come with their own
volume managers (VxFS and VxVM, XFS and The ZFS block diagram is the right side of Figure 3.
XLV, UFS and SVM). The performance and effi- Starting from the bottom, the device driver exports
ciency of the entire storage software stack should a block device to the Storage Pool Allocator (SPA).
be improved by changing the interface between The SPA handles block allocation and I/O, and
the file system and volume manager to something exports virtually addressed, explicitly allocated
more useful than the block device interface. The and freed blocks to the Data Management Unit
resulting solution should be lightweight enough (DMU). The DMU turns the virtually addressed
that it imposes virtually no performance penalty in blocks from the SPA into a transactional object
the case of a storage pool containing a single plain interface for the ZFS POSIX Layer (ZPL). Finally,
device. the ZPL implements a POSIX file system on top of
DMU objects, and exports vnode operations to the
system call layer.
2.8 Excellent performance
The block diagrams are arranged so that roughly
Finally, performance should be excellent. Perfor- equivalent functional blocks in the two models are
mance and features are not mutually exclusive. By 6 Vnode operations are part of the VFS (Virtual File Sys-
necessity, we had to start from a clean slate with
tem interface), a generic interface between the operating sys-
ZFS, which allowed us to redesign or eliminate tem and the file system. An example vnode operation is the
crufty old interfaces accumulated over the last vop create() operation, which creates a new file.

5
System Call System Call

Vnode VOP_MUMBLE() VOP_MUMBLE() Vnode


Interface Interface

ZFS POSIX Layer (ZPL)

Object
File System < dataset, object, offset > Transaction
Interface

Data Management Unit (DMU)

Block Data
Device < logical device, offset > < data virtual address > Virtual
Interface Addressing

Volume Manager Storage Pool Allocator (SPA)

Block Block
Device < physical device, offset > < physical device, offset > Device
Interface Interface

Device Driver Device Driver

Figure 3: Traditional file system block diagram (left), vs. the ZFS block diagram (right).

lined up with each other. Note that we have sepa- requests the removal of a device, the SPA can move
rated the functionality of the file system component allocated blocks off the disk by copying them to a
into two distinct parts, the ZPL and the DMU. We new location and changing its translation for the
also replaced the block device interface between the blocks’ DVAs without notifying anyone else.
rough equivalents of the file system and the volume
manager with a virtually addressed block interface. The SPA also simplifies administration. System
administrators no longer have to create logical
devices or partition the storage, they just tell the
3.2 The Storage Pool Allocator SPA which devices to use. By default, each file
system can use as much storage as it needs from its
The Storage Pool Allocator (SPA) allocates blocks storage pool. If necessary, the administrator can set
from all the devices in a storage pool. One system quotas and reservations on file systems or groups of
can have multiple storage pools, although most file systems to control the maximum and minimum
systems will only need one pool. Unlike a volume amount of storage available to each.
manager, the SPA does not present itself as a
logical block device. Instead, it presents itself as The SPA has no limits that will be reached in the
an interface to allocate and free virtually addressed next few decades. It uses 128-bit block addresses,
blocks — basically, malloc() and free() for disk so each storage pool can address up to 256 billion
space. We call the virtual addresses of disk blocks billion billion billion blocks and contain hundreds
data virtual addresses (DVAs). Using virtually of thousands of file systems.7 From the current
addressed blocks makes it easy to implement several state of knowledge in physics, we feel confident that
of our design principles. First, it allows dynamic 128-bit addresses will be sufficient for at least a few
addition and removal of devices from the storage more decades.8
pool without interrupting service. None of the
code above the SPA layer knows where a particular 7 The number of file systems is actually constrained by the
block is physically located, so when a new device design of the operating system (e.g., the 32-bit dev t in the
stat structure returned by the stat(2) family of system calls)
is added, the SPA can immediately start allocating
rather than any limit in ZFS itself.
new blocks from it without involving the rest of 8 Using quantum mechanics, Lloyd[11] calculates that a de-

the file system code. Likewise, when the user vice operating at sub-nuclear energy levels (i.e., it’s still in

6
3.2.2 Virtual devices
The SPA also implements the usual services of a
volume manager: mirroring, striping, concatena-
tion, etc. We wanted to come up with a simple,
modular, and lightweight way of implementing ar-
bitrarily complex arrangements of mirrors, stripes,
concatenations, and whatever else we might think
of. Our solution was a building block approach:
small modular virtual device drivers called vdevs.
Figure 4: ZFS stores checksums in parent indirect A vdev is a small set of routines implementing a
blocks; the root of the tree stores its checksum in itself. particular feature, like mirroring or striping. A
vdev has one or more children, which may be other
vdevs or normal device drivers. For example, a
mirror vdev takes a write request and sends it to all
of its children, but it sends a read request to only
3.2.1 Error detection and correction one (randomly selected) child. Similarly, a stripe
To protect against data corruption, each block is vdev takes an I/O request, figures out which of its
checksummed before it is written to disk. A block’s children contains that particular block, and sends
checksum is stored in its parent indirect block the request to that child only. Most vdevs take
(see Figure 4). As we describe in more detail in only about 100 lines of code to implement; this is
Section 3.3, all on-disk data and metadata is stored because on-disk consistency is maintained by the
in a tree of blocks, rooted at the überblock. The DMU, rather than at the block allocation level.
überblock is the only block that stores its checksum
in itself. Keeping checksums in the parent of a Each storage pool contains one or more top-level
block separates the data from its checksum on vdevs, each of which is a tree of vdevs of arbi-
disk and makes simultaneous corruption of data trary depth. Each top-level vdev is created with
and checksum less likely. It makes the checksums a single command using a simple nested descrip-
self-validating because each checksum is itself tion language. The syntax is best described with
checksummed by its parent block. Another benefit an example: To make a pool containing a sin-
is that the checksum doesn’t need to be read in gle vdev that is a two-way mirror of /dev/dsk/a
from a separate block, since the indirect block and /dev/dsk/b, run the command “zpool create
was read in to get to the data in the first place. mirror(/dev/dsk/a,/dev/dsk/b)”. For readabil-
The checksum function is a pluggable module; by ity, we also allow a more relaxed form of the syn-
default the SPA uses 64-bit Fletcher checksums[6]. tax when there is no ambiguity, e.g., “zpool create
Checksums are verified whenever a block is read in mirror /dev/dsk/a /dev/dsk/b”. Figure 5 shows
from disk and updated whenever a block is written an example of a possible vdev tree constructed by
out to disk. Since all data in the pool, including all an administrator who was forced to cobble together
metadata, is in the tree of blocks, everything ZFS 100 GB of mirrored storage out of two 50 GB disks
ever writes to disk is checksummed. and one 100 GB disk — hopefully an uncommon sit-
uation.
Checksums also allow data to be self-healing in
some circumstances. When the SPA reads in a block
from disk, it has a high probability of detecting any 3.2.3 Block allocation strategy
data corruption. If the storage pool is mirrored,
the SPA can read the good copy of the data and The SPA allocates blocks in a round-robin fashion
repair the damaged copy automatically (presuming from the top-level vdevs. A storage pool with
the storage media hasn’t totally malfunctioned). multiple top-level vdevs allows the SPA to use
dynamic striping9 to increase disk bandwidth.
Since a new block may be allocated from any of
the form of atoms) could store a maximum of 1025 bits/kg. the top-level vdevs, the SPA implements dynamic
The minimum mass of a device capable of storing 2128 bytes
would be about 272 trillion kg; for comparison, the Empire 9 The administrator may also configure “static” striping if

state building masses about 500 billion kg. desired.

7
3.3 The Data Management Unit
mirror
The next component of ZFS is the Data Manage-
ment Unit (DMU). The DMU consumes blocks from
concat disk the SPA and exports objects (flat files). Objects
live within the context of a particular dataset. A
dataset provides a private namespace for the objects
disk disk contained by the dataset. Objects are identified by
100G 64-bit numbers, contain up to 264 bytes of data, and
can be created, destroyed, read, and written. Each
write to (or creation of or destruction of) a DMU
50G 50G
object is assigned to a particular transaction10 by
the caller.
Figure 5: Example vdev with the description
mirror(concat(/dev/dsk/a,/dev/dsk/b),/dev/dsk/c) The DMU keeps the on-disk data consistent at
where disks a and b are the 50 GB disks and disk c is all times by treating all blocks as copy-on-write.
the 100 GB disk. All data in the pool is part of a tree of indirect
blocks, with the data blocks as the leaves of the
striping by spreading out writes across all available tree. The block at the root of the tree is called
top-level vdevs at whatever granularity is conve- the überblock. Whenever any part of a block is
nient (remember, blocks are virtually addressed, written, a new block is allocated and the entire
so the SPA don’t need a fixed stripe width to modified block is copied into it. Since the indirect
calculate where a block is located). As a result, block must be written in order to record the new
reads also tend to spread out across all top-level location of the data block, it must also be copied
vdevs. When a new device is added to the stor- to a new block. Newly written indirect blocks
age pool, the SPA immediately begins allocating “ripple” all the way up the tree to the überblock.
blocks from it, increasing the total disk band- See Figures 6–8.
width without any further intervention (such as
creating a new stripe group) from the administrator. When the DMU gets to the überblock at the root
of the tree, it rewrites it in place, in effect switch-
The SPA uses a derivative of the slab allocator[2] ing atomically from the old tree of blocks to the
to allocate blocks. Storage is divided up into new tree of blocks. In case the rewrite does not
metaslabs, which are in turn divided into blocks of complete correctly, the überblock has an embedded
a particular size. We chose to use different sizes of checksum that detects this form of failure, and the
blocks rather than extents in part because extents DMU will read a backup überblock from another
aren’t amenable to copy-on-write techniques and location. Transactions are implemented by writing
because the performance benefits of extents are out all the blocks involved and then rewriting the
achievable with a block-based file system[14]. For überblock once for the entire transaction. For effi-
good performance, a copy-on-write file system needs ciency, the DMU groups many transactions together,
to find big chunks of contiguous free space to write so the überblock and other indirect blocks are only
new blocks to, and the slab allocator already has a rewritten once for many data block writes. The
proven track record of efficiently preventing frag- threshold for deciding when to write out a group of
mentation of memory in the face of variable sized transactions is based on both time (each transaction
allocations without requiring a defragmentation waiths a maximum of a few seconds) and amount of
thread. By contrast, log-structured file systems changes built up.
require the creation of contiguous 512KB or 1MB
segments of free disk space[16, 17], and the overhead The DMU fulfills several more of our design prin-
of the segment cleaner that produced these segments ciples. The transactional object-based interface
turned out to be quite significant in some cases[19] provides a simple and generic way for any file system
(although recent work[22] has reduced the cleaning 10 A transaction is a group of changes with the guarantee
load). Using the slab allocator allows us much more
that either of all of the changes will complete (be visible on-
freedom with our block allocation strategy. disk) or none of the changes will complete, even in the face
of hardware failure or system crash.

8
the on-disk inode, and free the object containing
the file’s data, in any order, and (3) commit the
transaction.11 In ZFS, on-disk consistency is im-
plemented exactly once for all file system operations.

The DMU’s generic object interface also makes


dynamic allocation of metadata much easier for
a file system. When the file system needs a new
inode, it simply allocates a new object and writes
the inode data to the object. When it no longer
needs that inode, it destroys the object containing
it. The same goes for user data as well. The
DMU allocates and frees blocks from the SPA as
Figure 6: Copy-on-write the data block. necessary to store the amount of data contained in
any particular object.

The DMU also helps simplify administration by


making file systems easy to create and destroy.
Each of the DMU’s objects lives in the context
of a dataset that can be created and destroyed
independently of other datasets. The ZPL imple-
ments an individual file system as a collection of
objects within a dataset which is part of a storage
pool. Many file systems can share the same storage,
without requiring the storage to be divided up
piecemeal beforehand or even requiring any file
system to know about any other file system, since
each gets its own private object number namespace.

Figure 7: Copy-on-write the indirect blocks.


3.4 The ZFS POSIX Layer

The ZFS POSIX Layer (ZPL) makes DMU objects


look like a POSIX file system with permissions and
all the other trimmings. Naturally, it implements
what has become the standard feature set for
POSIX-style file systems: mmap(), access control
lists, extended attributes, etc. The ZPL uses the
DMU’s object-based transactional interface to store
all of its data and metadata. Every change to the
on-disk data is carried out entirely in terms of
creating, destroying, and writing objects. Changes
to individual objects are grouped together into
transactions by the ZPL in such a way that the on-
disk structures are consistent when a transaction is
Figure 8: Rewrite the überblock in place. completed. Updates are atomic; as long as the ZPL
groups related changes into transactions correctly,
it will never see an inconsistent on-disk state (e.g.,
implemented on top of it to keep its on-disk data
an inode with the wrong reference count), even in
self-consistent. For example, if a file system needs
to delete a file, it would (1) start a transaction, 11 This is, of course, only one way to implement a file system
(2) remove the directory entry by removing it from using DMU objects. For example, both the inode and the data
the directory’s object, free the object containing could be stored in the same DMU object.

9
the event of a crash. Now, look at the pool we just created:

Instead of using the mkfs program to create file # zpool info home
Pool size used avail capacity
systems, the ZPL creates each new file system
home 80G 409M 80G 1%
itself. To do so, the ZPL creates and populates a
few DMU objects with the necessary information
Let’s create and mount several file systems, one for
(e.g., the inode for the root directory). This is a
each user’s home directory, using the “-c” or “create
constant-time operation, no matter how large the
new file system” option:
file system will eventually grow. Overall, creating
a new file system is about as complicated and # zfs mount -c home/user1 /export/home/user1
resource-consuming as creating a new directory. In # zfs mount -c home/user2 /export/home/user2
ZFS, file systems are cheap. # zfs mount -c home/user3 /export/home/user3

The attentive reader will have realized that our Now, verify that they’ve been created and mounted:
implementation of transactions will lose the last
few seconds of writes if the system unexpectedly # df -h -F zfs
Filesystem size used avail use% Mounted on
crashes. For applications where those seconds home/user1 80G 4K 80G 1% /export/home/user1
matter (such as NFS servers), the ZPL includes an home/user2 80G 4K 80G 1% /export/home/user2
intent log to record the activity since the last group home/user3 80G 4K 80G 1% /export/home/user3
of transactions was committed. The intent log is
not necessary for on-disk consistency, only to re- Notice that we just created several file systems with-
cover uncommitted writes and provide synchronous out making a partition for each one or running
semantics. Placing the intent log at the level of the newfs. Let’s add some new disks to the storage pool:
ZPL allows us to record entries in the more compact
form of “create the file ‘foo’ in the directory whose # zpool add home mirror /dev/dsk/c3t1d0s0 \
inode is 27” rather than “set these blocks to these /dev/dsk/c5t1d0s0
# df -h -F zfs
values.” The intent log can log either to disk or to Filesystem size used avail use% Mounted on
NVRAM. home/user1 160G 4K 160G 1% /export/home/user1
home/user2 160G 4K 160G 1% /export/home/user2
home/user3 160G 4K 160G 1% /export/home/user3

4 ZFS in action As you can see, the new space is now available to
all of the file systems.
All of this high-level architectural stuff is great,
but what does ZFS actually look like in practice? Let’s copy some files into one of the file systems, then
In this section, we’ll use a transcript (slightly simulate data corruption by writing random garbage
edited for two-column format) of ZFS in action to to one of the disks:
demonstrate three of the benefits we claim for ZFS:
simplified administration, virtualization of storage, # cp /usr/bin/v* /export/home/user1
and detection and correction of data corruption. # dd if=/dev/urandom of=/dev/rdsk/c3t0d0s0 \
First, we’ll create a storage pool and several ZFS count=10000
10000+0 records in
file systems. Next, we’ll add more storage to the
10000+0 records out
pool dynamically and show that the file systems
start using the new space immediately. Then
Now, diff12 all of the files with the originals in
we’ll deliberately scribble garbage on one side of a
/usr/bin:
mirror while it’s in active use, and show that ZFS
automatically detects and corrects the resulting # diff /export/home/user1/vacation /usr/bin/vacation
data corruption. [...]
# diff /export/home/user1/vsig /usr/bin/vsig
First, let’s create a storage pool named “home” us-
ing a mirror of two disks: No corruption! Let’s look at some statistics to see
how many errors ZFS detected and repaired:
# zpool create home mirror /dev/dsk/c3t0d0s0 \
/dev/dsk/c5t0d0s0 12 Using a version of diff that works on binary files, of course.

10
# zpool info -v home their disks,13 ZFS allows them to select cheaper
I/O per sec I/O errors
vdev description read write found fixed checksum functions or turn them off entirely. Even
1 mirror(2,3) 0 0 82 82 the highest quality disks can’t detect adminis-
2 /dev/dsk/c3t0d0s0 0 0 82 0 tration or software errors, so we still recommend
3 /dev/dsk/c5t0d0s0 0 0 0 0 checksums to protect against errors such as acci-
4 mirror(5,6) 0 0 0 0
5 /dev/dsk/c3t1d0s0 0 0 0 0 dentally configuring the wrong disk as a swap device.
6 /dev/dsk/c5t1d0s0 0 0 0 0
Checksums actually speed up ZFS in some cases.
As you can see, 82 errors were originally found in the They allow us to cheaply validate data read from
first child disk. The child disk passed the error up one side of a mirror without reading the other side.
to the parent mirror vdev, which then reissued the When two mirror devices are out of sync, ZFS
read request to its second child. The checksums on knows which side is correct and can repair the bad
the data returned by the second child were correct, side of the mirror. Since ZFS knows which blocks of
so the mirror vdev returned the good data and also a mirror are in use, it can add new sides to a mirror
rewrote the bad copy of the data on the first child by duplicating only the data in the allocated blocks,
(the “fixed” column in the output would be better rather than copying gigabytes of garbage from one
named “fixed by this vdev”). disk to another.

One tradeoff we did not make: sacrificing simplicity


5 Design tradeoffs of implementation for features. Because we were
starting from scratch, we could design in features
We chose to focus our design on data integrity, from the beginning rather than bolting them on
recoverability and ease of administration, following later. As a result, ZFS is simpler, rather than
the lead of other systems researchers[15] calling for more complex, than many of its predecessors. Any
a widening of focus in systems research from perfor- comparison of ZFS with other file systems should
mance alone to overall system availability. The last take into account the fact that ZFS includes the
two decades of file system research have taken file functionality of a volume manager and reduces the
systems from 4% disk bandwidth utilization[13] to amount of user utility code necessary (through the
90–95% of raw performance on an array of striped elimination of fsck,14 mkfs, and similar utilities).
disks[20]. For file systems, good performance is no
longer a feature, it’s a requirement. As such, this As of this writing, ZFS is about 25,000 lines of
paper does not focus on the performance of ZFS, kernel code and 2,000 lines of user code, while
but we will briefly review the effect of our design Solaris’s UFS and SVM (Solaris Volume Manager)
decisions on performance-related issues as part of together are about 90,000 lines of kernel code and
our discussion on design tradeoffs in general. 105,000 lines of user code. ZFS provides more
functionality than UFS and SVM with about 1/7th
Copy-on-write of every block provides always of the total lines of code. For comparison with
consistent on-disk data, but it requires a much another file system which provides comparable
smarter block allocation algorithm and may cause scalability and capacity, XFS (without its volume
nonintuitive out-of-space errors when the disk is manager or user utility code) was over 50,000 lines
nearly full. At the same time, it allows us to write to of code in 1996[20].
any unallocated block on disk, which gives us room
for many performance optimizations, including
coalescing many small random writes into one large 6 Related work
sequential write.
The file system that has come closest to our design
We traded some amount of performance in order principles, other than ZFS itself, is WAFL[8], the file
to checksum all on-disk data. This should be miti- system used internally by Network Appliance’s NFS
gated by today’s fast processors and by increasingly
13 And their cables, I/O bridges, I/O busses, and device
common hardware support for encryption or check-
drivers.
summing, which ZFS can easily take advantage of. 14 We will provide a tool for recovering ZFS file systems that
We believe that the advantages of checksums far have suffered damage, but it won’t resemble fsck except in
outweigh the costs; however, for users who trust the most superficial way.

11
server appliances. WAFL, which stands for Write a matter of days. Encryption at the block level will
Anywhere File Layout, was the first commercial be trivial to implement, once we have a solution
file system to use the copy-on-write tree of blocks to the more difficult problem of a suitable key
approach to file system consistency. Both WAFL management infrastructure. Now that file systems
and Episode[4] store metadata in files. WAFL also are cheap and easy to create, they appear to be a
logs operations at the file system level rather than logical administrative unit for permissions, ACLs,
the block level. ZFS differs from WAFL in its use of encryption, compression and more. ZFS supports
pooled storage and the storage pool allocator, which multiple block sizes, but our algorithm for choosing
allows file systems to share storage without knowing a block size is in its infancy. We would like to auto-
anything about the underlying layout of storage. matically detect the block size used by applications
WAFL uses a checksum file to hold checksums for (such as databases) and use the same block size
all blocks, whereas ZFS’s checksums are in the internally, in order to avoid read-modify-write of
indirect blocks, making checksums self-validating file system blocks. Block allocation in general is an
and eliminating an extra block read. Finally, ZFS area with many possibilities to explore.
is a general purpose UNIX file system, while WAFL
is currently only used inside network appliances.

XFS and JFS dynamically allocate inodes[1, 20], 8 Status


but don’t provide a generic object interface to
dynamically allocate other kinds of data. Episode
appears15 to allow multiple file systems (or “file- The implementation of ZFS began in January of
sets” or “volumes”) to share one partition (an 2002. The transcript in Section 4 was created
“aggregate”), which is one step towards fully in October 2002, showing that we implemented
shared, dynamically resizeable pooled storage some of what appear to be the most extravagant
like ZFS provides. Although it doesn’t use true features of ZFS very early on. As of this writing,
pooled storage, WAFL deserves recognition for ZFS implements all of the features described in
its simple and reliable method of growing file Sections 3.1–3.4 with the exception of removal of
systems, since it only needs to grow its inode and devices from the storage pool, quotas and reserva-
block allocation map files when more space is added. tions, and the intent log. We plan to deliver all
these features plus snapshots in our first release of
As with any other file system, ZFS borrows many ZFS, sometime in 2004.
techniques from databases, cryptography, and
other research areas. ZFS’s real contribution is
integrating all these techniques together in an
actual implementation of a general purpose POSIX 9 Conclusion
file system using pooled storage, transactional
copy-on-write of all blocks, an object-based storage
model, and self-validating checksums. Current file systems still suffer from a variety of
problems: complicated administration, poor data
integrity guarantees, and limited capacity. The
7 Future work key architectural elements of ZFS that solve these
problems are pooled storage, the movement of block
ZFS opens up a great number of possibilities. allocation out of the file system and into the storage
Databases or NFS servers could use the DMU’s pool allocator, an object-based storage model,
transactional object interface directly. File systems checksumming of all on-disk blocks, transactional
of all sorts are easily built on top of the DMU — copy-on-write of all blocks, and 128-bit virtual
two of our team members implemented the first block addresses. Our implementation was relatively
usable prototype of the ZPL in only six weeks. The simple, especially considering that we incorporate
zvol driver, a sparse volume emulator that uses a all the functionality of a volume manager (mirror-
DMU object as backing store, was implemented in ing, etc.). ZFS sets an entirely new standard for
15 Changes of terminology and seemingly conflicting descrip- data integrity, capacity, and ease of administration
tions make us unsure we have interpreted the literature cor- in file systems.
rectly.

12
References [14] Larry W. McVoy and Steve R. Kleiman. Extent-like
performance from a UNIX file system. In Proceed-
[1] Steve Best. How the journaled file system cuts ings of the 1991 USENIX Winter Technical Confer-
system restart times to the quick. http://www- ence, 1991.
106.ibm.com/developerworks/linux/library/l- [15] David A. Patterson, A. Brown, P. Broadwell,
jfs.html, January 2000. G. Candea, M. Chen, J. Cutler, P. Enriquez,
A. Fox, E. Kiciman, M. Merzbacher, D. Oppen-
[2] Jeff Bonwick. The slab allocator: An object-caching
heimer, N. Sastry, W. Tetzlaff, J. Traupman, and
kernel memory allocator. In Proceedings of the 1994
N. Treuhaft. Recovery-oriented computing (ROC):
USENIX Summer Technical Conference, 1994.
Motivation, definition, techniques, and case studies.
[3] Aaron Brown and David A. Patterson. To err is Technical Report UCB//CSD-02-1175, UC Berke-
human. In Proceedings of the First Workshop on ley Computer Science Technical Report, 2002.
Evaluating and Architecting System dependabilitY
[16] Mendel Rosenblum and John K. Ousterhout. The
(EASY ’01), 2001.
design and implementation of a log-structured file
[4] Sailesh Chutani, Owen T. Anderson, Michael L. system. ACM Transactions on Computer Systems,
Kazar, Bruce W. Leverett, W. Anthony Mason, and 10(1):26–52, 1992.
Robert N. Sidebotham. The Episode file system. In [17] Margo Seltzer, Keith Bostic, Marshall K. McKu-
Proceedings of the 1992 USENIX Winter Technical sick, and Carl Staelin. An implementation of a
Conference, 1992. log-structured file system for UNIX. In Proceedings
[5] John R. Douceur and William J. Bolosky. A large- of the 1993 USENIX Winter Technical Conference,
scale study of file-system contents. In Proceedings of 1993.
the International Conference on Measurement and [18] Margo Seltzer, Greg Ganger, M. Kirk McKu-
Modeling of Computer Systems, 1999. sick, Keith Smith, Craig Soules, and Christopher
[6] John G. Fletcher. An arithmetic checksum for serial Stein. Journaling versus soft updates: Asyn-
transmissions. IEEE Transactions on Communica- chronous meta-data protection in file systems. In
tions, COM-30(1):247–252, 1982. Proceedings of the 2000 USENIX Technical Confer-
ence, 2000.
[7] Andrew Hanushevsky and Marcia Nowark. Pur-
suit of a scalable high performance multi-petabyte [19] Margo Seltzer, Keith Smith, Hari Balakrishnan,
database. In IEEE Symposium on Mass Storage Jacqueline Chang, Sara McMains, and Venkata N.
Systems, 1999. Padmanabhan. File system logging versus cluster-
ing: A performance comparison. In Proceedings
[8] Dave Hitz, James Lau, and Michael Malcolm. File of the 1995 USENIX Winter Technical Conference,
system design for an NFS file server appliance. In 1995.
Proceedings of the 1994 USENIX Winter Technical
Conference, 1994. [20] Adam Sweeney, Doug Doucette, Wei Hu, Curtis An-
derson, Michael Nishimoto, and Geoff Peck. Scala-
[9] Michael L. Kazar, Bruce W. Leverett, Owen T. An- bility in the XFS file system. In Proceedings of the
derson, Vasilis Apostolides, Beth A. Bottos, Sailesh 1996 USENIX Technical Conference, 1996.
Chutani, Craig F. Everhart, W. Anthony Mason,
[21] Jon William Toigo. Avoiding a data crunch. Scien-
Shu-Tsui Tu, and Edward R. Zayas. DEcorum file
tific American, May 2000.
system architectural overview. In Proceedings of the
1990 USENIX Summer Technical Conference, 1990. [22] Jun Wang and Yiming Hu. WOLF—a novel re-
ordering write buffer to boost the performance of
[10] T. J. Kowalski and Marshall K. McKusick. Fsck
log–structured file systems. In Proceedings of the
- the UNIX file system check program. Technical 2002 USENIX File Systems and Storage Technolo-
report, Bell Laboratories, March 1978. gies Conference, 2002.
[11] Seth Lloyd. Ultimate physical limits to computa-
tion. Nature, 406:1047–1054, 2000.
[12] M. Kirk McKusick and Gregory R. Ganger. Soft
updates: A technique for eliminating most syn-
chronous writes in the fast filesystem. In Proceed-
ings of the 1999 USENIX Technical Conference -
Freenix Track, 1999.
[13] M. Kirk McKusick, William N. Joy, Samuel J. Lef-
fler, and Robert S. Fabry. A fast file system for
UNIX. Computer Systems, 2(3):181–197, 1984.

13

You might also like