Journal Design PDF
Journal Design PDF
Stephen C. Tweedie
[email protected]
Abstract
This paper describes a work-in-progress to design and implement a transactional metadata journal for
the Linux ext2fs filesystem. We review the problem of recovering filesystems after a crash, and describe a
design intended to increase ext2fss speed and reliability of crash recovery by adding a transactional
journal to the filesystem.
Introduction
Filesystems are central parts of any modern operating system, and are expected to be both fast and exceedingly reliable. However, problems still occur,
and machines can go down unexpectedly, due to
hardware, software or power failures.
However, there are many internal aspects of a filesystem which are not so constrained, and which a
filesystem implementor can design with a certain
amount of freedom. The layout of data on disk (or
alternatively, perhaps, its network protocol, if the
filesystem is not local), details of internal caching,
and the algorithms used to schedule disk IOthese
are all things which can be changed without necessarily violating the specification of the filesystems
application interface.
Whats in a filesystem?
What functionality do we require of any filesystem?
There are obvious requirements which are dictated
by the operating system which the filesystem is
serving. The way that the filesystem appears to applications is oneoperating systems typically require
that filenames adhere to certain conventions and that
LinuxExpo 98
Performance should not suffer seriously as a result of using the new filesystem;
Page 1
Filesystem Reliability
There are a number of issues at stake when we talk
about filesystem reliability. For the purpose of this
particular project, we are interested primarily in the
reliability with which we can recover the contents of
a crashed filesystem, and we can identify several
aspects of this:
Preservation: data which was stable on disk before
the crash should never ever be damaged. Obviously,
files which were being written out at the time of the
crash cannot be guaranteed to be perfectly intact,
but any files which were already safe on disk must
not be touched by the recovery system.
Predictability: the failure modes from which we
have to recover should be predictable in order for us
to recover reliably.
Atomicity: many filesystem operations require a significant number of separate IOs to complete. A good
example is the renaming of a file from one directory
to another. Recovery is atomic if such filesystem
operations are either fully completed on disk or fully
undone after recovery finishes. (For the rename example, recovery should leave either the old or the
new filename committed to disk after a crash, but
not both.)
Existing implementations
The Linux ext2fs filesystem offers preserving recovery, but it is non-atomic and unpredictable. Predictability is in fact a much more complex property
than appears at first sight. In order to be able to predictably mop up after a crash, the recovery phase
must be able to work out what the filesystem was
trying to do at the time if it comes across an inconsistency representing an incomplete operation on the
disk. In general, this requires that the filesystem
must make its writes to disk in a predictable order
whenever a single update operation changes multiple blocks on disk.
There are many ways of achieving this ordering between disk writes. The simplest is simply to wait for
the first writes to complete before submitting the
next ones to the device driverthe synchronous
metadata update approach. This is the approach
taken by the BSD Fast File System[1], which appeared in 4.2BSD and which has inspired many of
the Unix filesystems which followed, including
ext2fs.
LinuxExpo 98
However, the big drawback of synchronous metadata update is its performance. If filesystems operation require that we wait for disk IO to complete,
then we cannot batch up multiple filesystem updates
into a single disk write. For example, if we create a
dozen directory entries in the same directory block
on disk, then synchronous updates require us to
write that block back to disk a dozen separate times.
There are ways around this performance problem.
One way to keep the ordering of disk writes without
actually waiting for the IOs to complete is to maintain an ordering between the disk buffers in
memory, and to ensure that when we do eventually
go to write back the data, we never write a block
until all of its predecessors are safely on diskthe
deferred ordered write technique.
One complication of deferred ordered writes is that
it is easy to get into a situation where there are cyclic dependencies between cached buffers. For example, if we try to rename a file between two directories and at the same time rename another file from
the second directory into the first, then we end up
with a situation where both directory blocks depend
on each other: neither can be written until the other
one is on disk.
Gangers soft updates mechanism[2] neatly sidesteps this problem by selectively rolling back specific updates within a buffer if those updates still
have outstanding dependencies when we first try to
write that buffer out to disk. The missing update will
be restored later once all of its own dependencies
are satisfied. This allows us to write out buffers in
any order we choose when there are circular dependencies. The soft update mechanism has been
adopted by FreeBSD and will be available as part of
their next major kernel version.
All of these approaches share a common problem,
however. Although they ensure that the state of the
disk is in a predictable state all the way through the
course of a filesystem operation, the recovery process still has to scan the entire disk in order to find
and repair any uncompleted operations. Recovery
becomes more reliable, but is not necessarily any
faster.
It is, however, possible to make filesystem recovery
fast without sacrificing reliability and predictability.
This is typically done by filesystems which guarantee atomic completion of filesystem updates (a
single filesystem update is usually referred to as a
transaction in such systems). The basic principle
Page 2
LinuxExpo 98
Page 3
Anatomy of a transaction
A central concept when considering a journaled filesystem is the transaction, corresponding to a single
update of the filesystem. Exactly one transaction
results from any single filesystem request made by
an application, and contains all of the changed metadata resulting from that request. For example, a
write to a file will result in an update to the modification timestamp in the files inode on disk, and
may also update the length information and the
block mapping information if the file is extended by
the write. Quota information, free disk space and
used block bitmaps will all have to be updated if
new blocks are allocated to the file, and all this must
be recorded in the transaction.
There is another hidden operation in a transaction
which we have to be aware about. Transactions also
involve reading the existing contents of the filesystem, and that imposes an ordering between transactions. A transaction which modifies a block on disk
cannot commit after a transaction which reads that
new data and then updates the disk based on what it
read. The dependency exists even if the two transactions do not ever try to write back the same
blocksimagine one transaction deleting a filename
from one block in a directory and another transaction inserting the same filename into a different
block. The two operations may not overlap in the
blocks which they write, but the second operation is
only valid after the first one succeeds (violating this
would result in duplicate directory entries).
Finally, there is one ordering requirement which
goes beyond ordering between metadata updates.
Before we can commit a transaction which allocates
new blocks to a file, we have to make absolutely
sure that all of the data blocks being created by the
transaction have in fact been written to disk (we
term these data blocks dependent data). Missing out
this requirement would not actually damage the integrity of the filesystems metadata, but it could potentially lead to a new file still containing a previous
file contents after crash recovery, which is a security
risk as well as being a consistency problem.
Merging transactions
Much of the terminology and technology used in a
journaled filesystem comes from the database
world, where journaling is a standard mechanism
for ensuring atomic commits of complex transactions. However, there are many differences between
LinuxExpo 98
Page 4
On-disk representation
The layout of the journaled ext2fs filesystem on disk
will be entirely compatible with existing ext2fs kernels. Traditional UNIX filesystems store data on
disk by associating each file with a unique numbered inode on the disk, and the ext2fs design already includes a number of reserved inode numbers.
We use one of these reserved inodes to store the
filesystem journal, and in all other respects the filesystem will be compatible with existing Linux kernels. The existing ext2fs design includes a set of
compatibility bitmaps, in which bits can be set to
indicate that the filesystem uses certain extensions.
By allocating a new compatibility bit for the journaling extension, we can ensure that even though
old kernels will be able to successfully mount a
new, journaled ext2fs filesystem, they will not be
permitted to write to the filesystem in any way.
LinuxExpo 98
Page 5
LinuxExpo 98
Page 6
LinuxExpo 98
Page 7
Conclusions
The filesystem design outlined in this paper should
offer significant advantages over the existing ext2fs
filesystem on Linux. It should offer increased availability and reliability by making the filesystem recover more predictably and more quickly after a
crash, and should not cause much, if any, performance penalty during normal operations.
The most significant impact on day-to-day performance will be that newly created files will have to
be synced to disk rapidly in order to commit the creates to the journal, rather than allowing the deferred
writeback of data normally supported by the kernel.
This may make the journaling filesystem unsuitable
for use on /tmp filesystems.
[5]
[6]
[7]
References
[1]
[2]
Soft Updates: A Solution to the Metadata Update Problem in File Systems. Ganger and
Patt. Technical report CSE-TR-254-95, Computer Science and Engineering Division, University of Michigan, 1995.
[3]
The design and implementation of a logstructured file system. Rosenblum and Ousterhout. Proceedings of the Thirteenth ACM
LinuxExpo 98
Page 8