Why NFS Sucks
Why NFS Sucks
Olaf Kirch
SUSE/Novell, Inc.
[email protected]
One of the shortcomings of NFSv2 was its lack the design of NFSv4 was to facilitate deploy-
of cache consistency. NFS makes no guaran- ment over wide area networks (making it an
tees that all clients looking at a file or directory “Internet Filesystem”), and to make the under-
see exactly the same content at any given mo- lying file system model less Unix centric and
ment. Instead, each client sees a snapshot of a provide better interoperability with Microsoft
file’s state from a hopefully not too distant past. Windows. It is not an accident that the NFSv4
NFS attempts to keep this snapshot in sync with working group formed at about the same time
the server, but if two clients operate on a single as Microsoft started rebranding their SMB file
file simultaneously, changes made by one client sharing protocol as the “Common Internet File
usually do not become visible immediately on System.”
the other client.
here). These handles went very well with the This made things interesting for the NFS server,
statelessness paradigm, as they remained valid because the inode information is no longer suf-
across server reboots. ficient to create something that the VFS layer is
willing to operate on—now we also need a little
Unfortunately, this sort of mechanism does not bit of path information to reconstruct the dentry
work very well for all file systems; in fact, it is a chain. For directories this is not hard, because
fairly Unix centric thing to assume that files can each directory of a Unixish file system has a
be accessed by some static identifier, and with- file named “..” that takes you to the parent
out going through their file system path name. directory. The NFS server simply has to walk
Not all file systems have a notion of a file inde- up that chain until it hits the file system root.
pendent from its path (the DOS and early Win- But for any other file system object, including
dows file systems kept the “inode” information regular files, there is no such thing, and thus the
inside the directory entry), and not all operating file handle needs to include an identifier for the
systems will operate on a disconnected inode. parent directory as well.
Also, the assumption that an inode number is
sufficient to locate a file on disk was true with This creates another interesting dilemma,
these older file systems, but that is no longer which is that a file hard linked into several di-
valid with more recent designs. rectories may be represented by different file
handles, depending on the path it is accessed
These assumptions can be worked around to
by. This is called aliasing and, depending on
some degree, but these workarounds do not
how well a client implementation handles this,
come for free, and carry their own set of prob-
may lead to inconsistencies in a client’s at-
lems with them.
tribute or data cache. Even worse, a rename
The easiest to fix is the inode number operation moving a file from one directory to
assumption—current Linux kernels allow file another will invalidate the old file handle.
systems to specify a pair of functions that return
a file handle for a given inode, and vice versa. As an interesting twist, NFSv4 introduces the
This allows XFS, reiser and ext3 to have their concept of volatile file handles. For these file
own file handle representation without adding handles, the server makes no promises about
loads of ugly special case code to nfsd. how long it will be good. At any time, the
server may return an error code indicating to
There is a second problem though, which the client that it has to re-lookup the handle. It
proved much harder to solve. Some time in the is not clear yet how well various NFSv4 are ac-
1.2 kernel or so, Linux introduced the concept tually able to cope with this.
of the directory cache, aka the dcache. An entry
in the dcache is called a dentry, and represents
the relation between a directory and one of its
entries. The semantics of the dcache do not 3 Write operations: As Slow as it
allow disconnected inode data floating around Gets
in the kernel; it requires that there is always a
valid chain of dentries going from the root of
the file system to the inode; and virtually all Another problem with statelessness is how to
functions in the VFS layer expect a dentry as prevent data loss or inconsistencies when the
an argument instead of (or in addition to) the server crashes hard. For instance, a server may
inode object they used to take. have acknowledged a client operation such as
54 • Why NFS Sucks
the creation of a file. If the server crashes be- calls need to happen relatively frequently (once
fore that change has been committed to disk, every few Megabytes). Second, a commit oper-
the client will never know, and it is in no posi- ation can become fairly costly—RAIDs usually
tion to replay the operation. like writes that cover one or more stripes, and
it helps if the client is smart enough to align
The way NFS solved this problem was to man- its writes in clusters of 128K or more. Second,
date that the server commits every change to some journaling file systems can have fairly big
disk before replying to the client. This is not delays in sync operations. If there is a lot of
that much of a problem for operations that usu- write traffic, it is not uncommon for the NFS
ally happen infrequently, such as file creation server to stall completely for several seconds
or deletion. However, this requirement quickly because all of its threads service commit re-
becomes a major nuisance when writing large quests.
files, because each block sent to the server is
written to disk separately, with the server wait- What’s more, some of the performance gain
ing for the disk to do its job before it responds in using write/commit is owed to the fact that
to the client. modern disk drives have internal write buffers,
so that flushing data to the disk device really
Over the years, different ways to take the edge
just sends data to the disk’s internal buffers,
off this problem were devised. Several compa-
which is not sufficient for the type of guarantee
nies sold so-called “NFS accelerators,” which
NFS is trying to give. Forcing the block device
was basically a card with a lot of RAM and
to actually flush its internal write cache to disk
a battery on it, acting as an additional, persis-
incurs an additional delay.
tent cache between the VFS layer and the disk.
Other approaches involved trying to flush sev-
eral parallel writes in one go (also called write
gathering). None of these solutions was en-
tirely satisfactory, and therefore, virtually all 4 NFS over UDP—Fragmentation
NFS implementations provide an option for the
administrator to turn off stable writes, trading
Another “interesting” feature of NFSv2 was
performance for a (small) risk of data corrup-
that the original implementations supported
tion or loss.
only UDP as the transport protocol. NFS over
NFSv3 tries to improve this by introducing a TCP did not come into widespread use until the
new writing strategy, where clients send a large late 1990s.
number of write operations that are not writ-
ten to disk directly, followed by a “commit” There have been various issues with the use
call that flushes all pending writes to disk. This of UDP for NFS over the years. At one
does afford a noticeable performance improve- point, some operating system shipped with
ment, but unfortunately, it does not solve all UDP checksums turned off by default, presum-
problems. ably for performance reasons. Which is a rather
bad thing to do if you’re doing NFS over UDP,
On one hand, NFS clients are required to keep because you can easily end up with silent data
all dirty pages around until the server acknowl- corruption that you will not notice until it is
edged the commit operation, beecause in case way too late, and the last backup tape having
the server was rebooted, they need to replay a correct version of your precious file has been
all these write operations. This means, commit overwritten.
2006 Linux Symposium, Volume Two • 55
A more recent problem with UDP has to do exactly the piece of the puzzle that is missing,
with fragmentation. The lower bound for the so it considers the fragment chain complete and
NFS packet size that makes sense for reads and reassembles a packet out of A1 , B2 , A3 .
writes is given by the client’s page size, which
is 4096 for most architectures Linux runs on, Fortunately, the UDP checksum check will usu-
and 8192 is a rather common choice these days. ally catch these botched reassemblies. But not
Unless you’re using jumbograms (i.e. Ethernet all of them—it is just another 16bit quantity, so
frames of up to 9000 bytes), these packets get if the above happens a few thousand times, the
fragmented. probability of a matching checksum is decid-
edly non-zero. Depending on your hardware
For those not familiar with IP fragmentation, and test case, it is possible to reproduce silent
here it is in a nutshell: if the sending system data corruption within a few days or even a few
(or, in IPv4, any intermediate router) notices hours.
that an IP packet is too large for the network in-
terface it needs to send this out to, it will break Starting with kernel version 2.6.16, Linux has
up the packet into several smaller pieces, each some code to protect from the ill side effects of
with a copy of the original IP header. In order IPID wraparound, by introducing some sort of
so that the receiving system can tell which frag- sliding window of valid IPIDs. But that is really
ments go together, the sending system assigns more of a band-aid than a real solution. The
each packet a 16bit identifier, the IPID. The better approach is to use TCP instead, which
receiver will lump all packets with matching avoids the problem entirely by not fragmenting
source address, destination address and IPID at all.
into one fragment chain, and when it finds it
has received all the pieces, it will stitch them
together and hand them to the network stack for
5 Retransmitted Requests
further processing. In case a fragment gets lost,
there is a so-called reassembly timeout, default-
ing to 30 seconds. If the fragment chain is not As UDP is an unreliable protocol by design,
completed during that interval, it will simply be NFS (or, more specifically, the RPC layer)
discarded. needs to deal with packet loss. This creates all
sorts of interesting problems, because we basi-
The bad thing is, on today’s network hardware, cally need to do all the things a reliable trans-
it is no big deal to send more than 65535 pack- port protocol does: retransmitting lost packets,
ets in under 30 seconds; in fact it is not un- flow control (if the NFS implementation sup-
common for the IPID counter to wrap around ports sending several requests in parallel), and
in 5 seconds or less. Assume a packet A, con- congestion avoidance. If you look at the RPC
taining an NFS READ reply is fragmented as implementation in the Linux kernel, you will
say A1 , A2 , A3 , and fragment A2 is lost. Then find a lot of things you may be familiar with
a few seconds later another NFS READ reply from a TCP context, such as slow start, or es-
is transmitted, which receives the same IPID, timators for round-trip times for more accurate
and is being fragmented as B1 , B2 , B3 . The re- timeouts.
ceiver will discard fragment B1 , because it al-
ready has a fragment chain for that IPID, and One of the less widely known problems with
the part of the packet represented by B1 is al- NFS over UDP however affected the file sys-
ready there. Then it will receive B2 , which is tem semantics. Consider a request to remove
56 • Why NFS Sucks
a directory, which the server dutifully per- socket without going into TIME_WAIT, which
formed and acknowledged. If the server’s reply allows it to reuse the same port immediately.
gets lost, the client will retransmit the request,
which will fail unexpectedly because the direc- Strictly speaking, this is in violation of the TCP
tory it is supposed to remove no longer exists! specification. While this avoids the problem
with the reply cache, it remains to be seen
Requests that will fail if retransmitted are called whether this entails any negative side effects—
non-idempotent. To prevent these from fail- for instance, how gracefully intermediate fire-
ing, a request replay cache was introduced in walls may deal with seeing SYN packets for
the NFS server, where replies to the most re- a connection that they think ought to be in
cent non-idempotent requests are cached. The TIME_WAIT.
NFS server identifies a retransmitted request by
checking the reply cache for an entry with the
same source address and port, and the same
RPC transaction ID (also known as the XID, a 6 Cache Consistency
32bit counter).
This provides reasonable protection for NFS As mentioned in the first section, NFS makes
over UDP as long as the cache is big enough no guarantees that all clients see exactly the
to hold replies for the client’s maximum re- same data at all times.
transmit timeout. As of the 2.6.16 kernel, the
Linux server’s reply cache is rather too small, Of course, during normal operation, accessing
but there is work underway to rewrite it. a file will show you the content that is actually
there, not some random gibberish. However, if
Interestingly, the reply cache is also useful two or more clients read and write the same file
when using TCP. TCP is not impacted the same simultaneously, NFS makes no effort to propa-
way UDP is, since retransmissions are handled gate all changes to all clients immediately.
by the network transport layer. Still, TCP con-
nections may break for various reasons, and the An NFS client is permitted to cache changes
server may find the client retransmit a request locally and send them to the server whenever
after reconnecting. it sees fit. This sort of lazy write-back greatly
helps write performance, but the flip side is
There is a little twist to this story. The TCP pro- that everyone else will be blissfully unaware
tocol specification requires that the host break- of these change before they hit the server. To
ing the connection does not reuse the same make things just a little harder, there is also no
port number for a certain time (twice the max- requirement for a client to transmit its cached
imum segment lifetime); this is also referred write in any particular fashion, so dirty pages
to as TIME_WAIT state. But usually you do can (and often will be) written out in random
not want to wait that long before reconnecting. order.
That means the new TCP connection will orig-
inate from a different port, and the server will And even once the modified data arrives at the
fail to find the retransmitted request in its cache. NFS server, not all clients will see this change
immediately. This is because the NFS server
To avoid that problem, the sunrpc code in re- does not keep track of who has a file open for
cent Linux kernels works around this by using reading and who does not (remember, we’re
a little known method for disconnecting a TCP stateless), so even if it wanted it cannot notify
2006 Linux Symposium, Volume Two • 57
clients of such a change. Therefore, it is the flushed to the server on closing the file, and a
client’s job to do regular checks if its cached cache revalidation occurs when you re-open it.
data is still valid.
One can hardly fail to notice that there is a lot of
So a client that has read the file once may con- handwaving in this sort of cache management.
tinue to use its cached copy of the file until This model is adequate for environments where
the next time it decides to check for a change. there is no concurrent read/write access by dif-
If that check reveals the file has changed, the ferent clients on the same file, such as when ex-
client is required to discard any cached data and porting users’ home directory, or a set of read-
retrieve the current copy from the server. only data.
The way an NFS client detects changes to a file However, this fails badly when applications
is peculiar as well. Again, as NFS is state- try to use NFS files concurrently, as some
less, there is no easy way to attach a mono- databases are known to do. This is simply
tonic counter or any other kind of versioning not within the scope of the NFS standards,
information to a file or directory. Instead, NFS and while NFSv3 and NFSv4 do improve some
clients usually store the file’s modification time aspects of cache consistency, these changes
and size along with the other cache details. At merely allow the client to cache more aggres-
regular intervals (usually somewhere between sively, but not necessarily more correctly. For
3 to 60 seconds), it performs a so-called cache instance, NFSv4 introduces the concept of del-
revalidation: The client retrieves the current set egations, which is basically a promise that the
of file attributes from the server and compares server will notify the client if some other host
the stored values to the current ones. If they opens the file for writing. Provided the server
match, it assumes the file has not changed and is willing and able to issue a delegation to the
the cached data is still valid. If there is a mis- client, this allows the client to cache all writes
match, all cached data is discarded, and dirty for as long as it holds that delegation. But after
pages are flushed to the server. the server revokes it, everyone just falls back to
the old NFSv3 behavior of mtime based cache
Unfortunately, most file systems store time revalidation.
stamps with second granularity, so clients will There is no really good solution to this prob-
fail to detect subsequent changes to a file if they lem; all solutions so far either involve turning
happen within the same wall-clock second as off caching to a fairly large degree, or extend-
their last revalidation. To compound the prob- ing the NFS protocol significantly.
lem, NFS clients usually hold on to the data
they have cached as long as they see fit. So Some documents recommend turning off
once the cache is out of sync with the server, caching entirely, by mounting the file system
it will continue to show this invalid informa- with the noac option, but this is really a
tion until the data is evicted from the cache to desparate measure, because it kills performance
make room, or until the file’s modification time completely.
changes again and forces the client to invalidate
its cache. Starting with the 2.6 kernel, the Linux NFS
client supports O_DIRECT mode for file I/O,
The only consistency guarantee made by NFS which turns off all read and write caching on a
is called close-to-open consistency, which file descriptor. This is slightly better than us-
means that any changes made by you are ing noac, as it still allows the caching of file
58 • Why NFS Sucks
attributes, but it means applications need to be not preserve the time stamp when copying files
modified and recompiled to use it. Its primary to NFS. The reason is the NFS write cache,
use is in the area of databases. which usually does not get flushed until the
file is closed. The way cp -p does its job is
Another approach to force a file to show a con- by creating the output file and writing all data
sistent view across different clients is to use first; then it calls utimes to set the modifica-
NFS file locking, because taking and releasing tion time stamp, and then closes the file. Now
a lock acts as a cache synchronization point. In close would see that there were still pend-
fact, in the Linux NFS client, the file unlock op- ing writes, and flush them out to the server,
eration actually implies a cache invalidation— clobbering the file’s mtime as a result. The
so this kind of synchronizyation is not exactly only viable fix for this is to make sure the NFS
free of cost either. client flushes all dirty pages before performing
the utimes update—in other words, utimes
Solutions involving changes to the NFS proto- acts like fsync.
col include Spritely NFS and NQNFS; but these
should probably considered as mostly research. Some other cases are a bit stranger. One such
It is questionable whether this gap in the NFS case is the ability to write to an open unlinked
design will ever be addressed, or whether this file. POSIX says an application can open a file
is left for others to solve, such as OCFS2, GFS for reading and writing, unlink it, and continue
or Lustre. to do I/O on it. The file is not supposed to go
away until the last application closes it.
work across different clients. But that should However, people who use ACLs usually want
not come as a surprise given the lack of cache to be able to view and modify them, too, with-
consistency. out having to log on to the server machine. NFS
protocol versions 2 and 3 do not provide any
Things get outright weird though if you con- mechanisms for queries or updates of ACLs
sider what happens when someone tries to un- at all, so different vendors devised their own
link such a .nfsXXX file. The Linux client side-band protocols that added this function-
does not allow this, in order to maintain POSIX ality. These are usually implemented as ad-
semantics as much as possible. The undesirable ditional RPC programs available on the same
side effect of this is that a rm -rf call will fail port as the NFS server itself. According to var-
to remove a directory if it contains a file that is ious sources, there were at least four different
currently open to some application. ACL protocols, all of them mutually incompat-
ible. So an SGI NFS client could do ACLs
But the weirdest part of POSIX conformance when talking to an SGI NFS server, or a So-
is probably the handling of access control lists, laris client could do the same when talking to a
and as such it deserves a section of its own. Solaris server.
NFSv3 or locally on the server machines, these to accept these as legitimate—opening the door
ACLs are ignored. to replay attacks.
The ironic part of the story is that Sun, which A few years ago, a new RPC authentication fla-
was one of the driving forces behind the NFSv4 vor based on GSSAPI was defined and stan-
standard, added an NFSv4 version to their ACL dardized; it provides different levels of secu-
side band protocol which allows querying and rity, ranging from the old-style sort of authenti-
updating of POSIX ACLs, without having to cation restricted to the RPC header, to integrity
translate them to NFSv4 ACLs and back. and/or privacy. And since GSSAPI is agnostic
of the underlying security system, this authenti-
cation mechanism can be used to integrate NFS
security with any security system that provides
9 NFS Security a GSSAPI binding.
Third, lockd does not only have to run on the AFS, the Andrew File System, was originally
server, it must be active on the client as well. developed jointly by Carnegie Mellon Univer-
That is because when a client blocked on a lock sity and IBM. It was probably never a huge
request, and the lock can later be granted, the success outside academia and research instal-
server is supposed to send a callback to the lations, despite the fact that the Open Group
client, so lockd must be active there as well. made it the basis of the distributed file system
This creates all kinds of headaches when doing for DCE (and charged an arm and a leg for it).
NFS through firewalls. Late in its life cycle, it was released by IBM
under an open source license, which managed
File locking is inherently a stateful operation, to breathe a little life back into it.
which does not go well with the statelessness
paradigm of the NFS protocol. In order to ad- AFS is a very featureful distributed file system.
dress this, mechanisms for lock reclaim were Among other things, it provides good security
added to NLM—if a NFS server reboots, there through the use of Kerberos 4, location inde-
is a so-called grace period during which clients pendent naming, and supports migration and
can re-register all the locks they were holding replication.
with the server.
On the down side, it comes with its own server
Obviously, in order to make this work, clients side storage file system, so that you cannot sim-
need to be notified when a server reboots. For ply export your favorite journaling file system
this, yet another side-band protocol was de- over AFS. Code portability, especially to 64bit
signed, called Network Status Monitor or NSM. platforms, and the sort of #ifdef accretion
62 • Why NFS Sucks
that can occur over the course of 20 years is dows machines. However, CIFS could be se-
also an issue. rious competition to NFS in the Linux world,
too—the biggest obstacle in this arena is not
a technical one, however, but the fact that it
is is controlled entirely by Microsoft, who like
12 CIFS
to spring the occasional surprise or two on the
open source world.
CIFS, the Common Internet File System, is
what was colloquially referred to as SMBfs
some time ago. Microsoft’s distributed file sys-
13 Cluster Filesystems
tem is session-based, and sticks closely to the
file system semantics of windows file systems.
Samba, and the Linux smbfs and cifs clients Another important area of development in the
have demonstrated that it is possible for Unix world of distributed file systems are clustered
platforms to interoperate with Windows ma- file systems such as Lustre, GFS and OCFS2.
chines using CIFS, but some things from the Especially the latter looks very interesting, as
POSIX world remain hard to map to their Win- its kernel component is relatively small and
dows equivalents and vice versa, with Access seems to be well-designed.
Control Lists (ACLs) being the most notorious
example. Cluster file systems are currently no replace-
ment for file systems such as NFS or CIFS, be-
CIFS provides some cache consistency through cause they usually require a lot more in terms of
the use of op-locks. It is a stateful protocol, infrastructure. Most of them do not scale very
and crash recovery is usually the job of the well beyond a few hundred nodes either.
application (we’re probably all familiar with
Abort/Retry/Ignore dialog boxes).
Volume Two
Review Committee
Jeff Garzik, Red Hat Software
Gerrit Huizenga, IBM
Dave Jones, Red Hat Software
Ben LaHaise, Intel Corporation
Matt Mackall, Selenic Consulting
Patrick Mochel, Intel Corporation
C. Craig Ross, Linux Symposium
Andrew Hutton, Steamballoon, Inc.
Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights
to all as a condition of submission.