0% found this document useful (0 votes)
91 views16 pages

Why NFS Sucks

NFS is really the distributed file system in the Unix world—and at the same time it is probably also one of its most reviled components. For just about every Suse release, there’s a bug in our bugzilla with a summary line of “NFS sucks.” NFS even has a whole chapter of its own in the Unix Haters’ Handbook. And having hacked quite a bit of NFS code over the course of 8 years, the author cannot help agreeing that NFS as a whole does have a number of warts. This presentation is an attempt at answer

Uploaded by

Desi Borisova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views16 pages

Why NFS Sucks

NFS is really the distributed file system in the Unix world—and at the same time it is probably also one of its most reviled components. For just about every Suse release, there’s a bug in our bugzilla with a summary line of “NFS sucks.” NFS even has a whole chapter of its own in the Unix Haters’ Handbook. And having hacked quite a bit of NFS code over the course of 8 years, the author cannot help agreeing that NFS as a whole does have a number of warts. This presentation is an attempt at answer

Uploaded by

Desi Borisova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Why NFS Sucks

Olaf Kirch
SUSE/Novell, Inc.
[email protected]

Abstract SVR3. It was based on a concept called “Re-


mote System Calls,” where each system call
was mapped directly to a call to the server sys-
NFS is really the distributed file system in the
tem. This worked reasonably well, but was
Unix world—and at the same time it is prob-
largely limited to SVR3 because of the SVR3
ably also one of its most reviled components.
system call semantics.
For just about every Suse release, there’s a bug
in our bugzilla with a summary line of “NFS
Another problem with RFS was that it did not
sucks.” NFS even has a whole chapter of its
tolerate server crashes or reboots very well.
own in the Unix Haters’ Handbook. And hav-
Due to the design, the server had to keep a lot
ing hacked quite a bit of NFS code over the
of state for every client, and, in fact, for every
course of 8 years, the author cannot help agree-
file opened by a client. This state could not be
ing that NFS as a whole does have a number of
recovered after a server reboot, so when an RFS
warts.
server went down, it usually took all its clients
This presentation is an attempt at answering with it.
why this is so. It will take a long look at some
of the stranger features of NFS, why they came This early experience helps to understand some
into existence, and how they affect stability, of the design decisions made in first NFS ver-
performance and POSIX conformance of the sion developed by Sun in 1985. This was NFS
file system. The talk will also present some his- version 2, and was first included in SunOS 2.0.
torical background, and compare NFS to other Rumors have it that there was also a version 1,
distributed file systems. but it never got released to the world outside
Sun.
The author feels compelled to mention that this
is not a complaint about the quality of the Linux NFSv2 attempted to address the shortcomings
NFS implementation, which is in fact pretty of RFS by making the server entirely stateless,
good. and by defining a minimal set of remote pro-
cedures that provided a basic set of file system
operations in a way that was a lot less operating
system dependent than RFS. It also tried to be
1 History
agnostic of the underlying file system, to a de-
gree that it could be adapted to different Unix
One of the earliest networked file systems file systems with relative ease (doing the same
was RFS, the Remote File System included in for non-Unix file systems proved harder).
52 • Why NFS Sucks

One of the shortcomings of NFSv2 was its lack the design of NFSv4 was to facilitate deploy-
of cache consistency. NFS makes no guaran- ment over wide area networks (making it an
tees that all clients looking at a file or directory “Internet Filesystem”), and to make the under-
see exactly the same content at any given mo- lying file system model less Unix centric and
ment. Instead, each client sees a snapshot of a provide better interoperability with Microsoft
file’s state from a hopefully not too distant past. Windows. It is not an accident that the NFSv4
NFS attempts to keep this snapshot in sync with working group formed at about the same time
the server, but if two clients operate on a single as Microsoft started rebranding their SMB file
file simultaneously, changes made by one client sharing protocol as the “Common Internet File
usually do not become visible immediately on System.”
the other client.

In 1988, Spritely NFS was released, which 2 NFS File Handles


extended the protocol to add a cache consis-
tency mechanism to NFSv2. To achieve this,
it sacrificed the server’s statelessness, so that it One of the nice things about NFS is that it al-
was generally impossible for a client to recover lows you to export very different types of file
from a server crash. Crash recovery for Spritely systems to the world. You’re not stuck with a
NFS was not added until 6 years later, in 1994. single file system implementation the way AFS
does, for instance. NFS does not care if it is
At about the same time, NQNFS (the “Not- reiser, ext3 or XFS you export, a CD or a DVD.
Quite NFS”) was introduced in 4.4 BSD.
A direct consequence of this is that NFS needs
NQNFS is a backward compatible protocol ex-
a fairly generic mechanism to identify the ob-
tension that adds the concept of leases to NFS,
jects residing on a file system. This is what file
which is another mechanism to provide cache
handles are for. From the client’s perspective,
consistency. Unfortunately, it never gained
these are just opaque blobs of data, like a magic
wide acceptance outside the BSD camp.
cookie. Only the server needs to understand the
internal format of a file handle. In NFSv2, these
In 1995, the specification of NFSv3 was pub-
handles were a fixed 32 bytes; NFSv3 makes
lished (written mostly by Rick Macklem, who
them variable sized up to 64 bytes, and NFSv4
also wrote NQNFS). NFSv3 includes several
doubles that once more.
improvements over NFSv2, most of which can
be categorized as performance enhancements. Another constraint is related to the statelessness
However, NFSv3 did not include any cache paradigm: file handles must be persistent, i.e.
consistency mechanisms. when the server crashes and reboots, the file
handles held by its clients must still be valid, so
The year 1997 saw the publication of a standard that the clients can continue whatever they were
called WebNFS, which was supposed to posi- doing at that moment (e.g. writing to a file).
tion NFS as an alternative to HTTP. It never
gained any real following outside of Sun, and In the Unix world of the mid-80s and early 90s,
made a quiet exit after the Internet bubble burst. a file handle merely represented an inode—and
in fact in most implementations—the file han-
The latest development in the NFS area is dle just contained the device and inode num-
NFSv4, the first version of this standard was ber of the file it represented (plus some ad-
published in 2002. One of the major goals in ditional export identification we will ignore
2006 Linux Symposium, Volume Two • 53

here). These handles went very well with the This made things interesting for the NFS server,
statelessness paradigm, as they remained valid because the inode information is no longer suf-
across server reboots. ficient to create something that the VFS layer is
willing to operate on—now we also need a little
Unfortunately, this sort of mechanism does not bit of path information to reconstruct the dentry
work very well for all file systems; in fact, it is a chain. For directories this is not hard, because
fairly Unix centric thing to assume that files can each directory of a Unixish file system has a
be accessed by some static identifier, and with- file named “..” that takes you to the parent
out going through their file system path name. directory. The NFS server simply has to walk
Not all file systems have a notion of a file inde- up that chain until it hits the file system root.
pendent from its path (the DOS and early Win- But for any other file system object, including
dows file systems kept the “inode” information regular files, there is no such thing, and thus the
inside the directory entry), and not all operating file handle needs to include an identifier for the
systems will operate on a disconnected inode. parent directory as well.
Also, the assumption that an inode number is
sufficient to locate a file on disk was true with This creates another interesting dilemma,
these older file systems, but that is no longer which is that a file hard linked into several di-
valid with more recent designs. rectories may be represented by different file
handles, depending on the path it is accessed
These assumptions can be worked around to
by. This is called aliasing and, depending on
some degree, but these workarounds do not
how well a client implementation handles this,
come for free, and carry their own set of prob-
may lead to inconsistencies in a client’s at-
lems with them.
tribute or data cache. Even worse, a rename
The easiest to fix is the inode number operation moving a file from one directory to
assumption—current Linux kernels allow file another will invalidate the old file handle.
systems to specify a pair of functions that return
a file handle for a given inode, and vice versa. As an interesting twist, NFSv4 introduces the
This allows XFS, reiser and ext3 to have their concept of volatile file handles. For these file
own file handle representation without adding handles, the server makes no promises about
loads of ugly special case code to nfsd. how long it will be good. At any time, the
server may return an error code indicating to
There is a second problem though, which the client that it has to re-lookup the handle. It
proved much harder to solve. Some time in the is not clear yet how well various NFSv4 are ac-
1.2 kernel or so, Linux introduced the concept tually able to cope with this.
of the directory cache, aka the dcache. An entry
in the dcache is called a dentry, and represents
the relation between a directory and one of its
entries. The semantics of the dcache do not 3 Write operations: As Slow as it
allow disconnected inode data floating around Gets
in the kernel; it requires that there is always a
valid chain of dentries going from the root of
the file system to the inode; and virtually all Another problem with statelessness is how to
functions in the VFS layer expect a dentry as prevent data loss or inconsistencies when the
an argument instead of (or in addition to) the server crashes hard. For instance, a server may
inode object they used to take. have acknowledged a client operation such as
54 • Why NFS Sucks

the creation of a file. If the server crashes be- calls need to happen relatively frequently (once
fore that change has been committed to disk, every few Megabytes). Second, a commit oper-
the client will never know, and it is in no posi- ation can become fairly costly—RAIDs usually
tion to replay the operation. like writes that cover one or more stripes, and
it helps if the client is smart enough to align
The way NFS solved this problem was to man- its writes in clusters of 128K or more. Second,
date that the server commits every change to some journaling file systems can have fairly big
disk before replying to the client. This is not delays in sync operations. If there is a lot of
that much of a problem for operations that usu- write traffic, it is not uncommon for the NFS
ally happen infrequently, such as file creation server to stall completely for several seconds
or deletion. However, this requirement quickly because all of its threads service commit re-
becomes a major nuisance when writing large quests.
files, because each block sent to the server is
written to disk separately, with the server wait- What’s more, some of the performance gain
ing for the disk to do its job before it responds in using write/commit is owed to the fact that
to the client. modern disk drives have internal write buffers,
so that flushing data to the disk device really
Over the years, different ways to take the edge
just sends data to the disk’s internal buffers,
off this problem were devised. Several compa-
which is not sufficient for the type of guarantee
nies sold so-called “NFS accelerators,” which
NFS is trying to give. Forcing the block device
was basically a card with a lot of RAM and
to actually flush its internal write cache to disk
a battery on it, acting as an additional, persis-
incurs an additional delay.
tent cache between the VFS layer and the disk.
Other approaches involved trying to flush sev-
eral parallel writes in one go (also called write
gathering). None of these solutions was en-
tirely satisfactory, and therefore, virtually all 4 NFS over UDP—Fragmentation
NFS implementations provide an option for the
administrator to turn off stable writes, trading
Another “interesting” feature of NFSv2 was
performance for a (small) risk of data corrup-
that the original implementations supported
tion or loss.
only UDP as the transport protocol. NFS over
NFSv3 tries to improve this by introducing a TCP did not come into widespread use until the
new writing strategy, where clients send a large late 1990s.
number of write operations that are not writ-
ten to disk directly, followed by a “commit” There have been various issues with the use
call that flushes all pending writes to disk. This of UDP for NFS over the years. At one
does afford a noticeable performance improve- point, some operating system shipped with
ment, but unfortunately, it does not solve all UDP checksums turned off by default, presum-
problems. ably for performance reasons. Which is a rather
bad thing to do if you’re doing NFS over UDP,
On one hand, NFS clients are required to keep because you can easily end up with silent data
all dirty pages around until the server acknowl- corruption that you will not notice until it is
edged the commit operation, beecause in case way too late, and the last backup tape having
the server was rebooted, they need to replay a correct version of your precious file has been
all these write operations. This means, commit overwritten.
2006 Linux Symposium, Volume Two • 55

A more recent problem with UDP has to do exactly the piece of the puzzle that is missing,
with fragmentation. The lower bound for the so it considers the fragment chain complete and
NFS packet size that makes sense for reads and reassembles a packet out of A1 , B2 , A3 .
writes is given by the client’s page size, which
is 4096 for most architectures Linux runs on, Fortunately, the UDP checksum check will usu-
and 8192 is a rather common choice these days. ally catch these botched reassemblies. But not
Unless you’re using jumbograms (i.e. Ethernet all of them—it is just another 16bit quantity, so
frames of up to 9000 bytes), these packets get if the above happens a few thousand times, the
fragmented. probability of a matching checksum is decid-
edly non-zero. Depending on your hardware
For those not familiar with IP fragmentation, and test case, it is possible to reproduce silent
here it is in a nutshell: if the sending system data corruption within a few days or even a few
(or, in IPv4, any intermediate router) notices hours.
that an IP packet is too large for the network in-
terface it needs to send this out to, it will break Starting with kernel version 2.6.16, Linux has
up the packet into several smaller pieces, each some code to protect from the ill side effects of
with a copy of the original IP header. In order IPID wraparound, by introducing some sort of
so that the receiving system can tell which frag- sliding window of valid IPIDs. But that is really
ments go together, the sending system assigns more of a band-aid than a real solution. The
each packet a 16bit identifier, the IPID. The better approach is to use TCP instead, which
receiver will lump all packets with matching avoids the problem entirely by not fragmenting
source address, destination address and IPID at all.
into one fragment chain, and when it finds it
has received all the pieces, it will stitch them
together and hand them to the network stack for
5 Retransmitted Requests
further processing. In case a fragment gets lost,
there is a so-called reassembly timeout, default-
ing to 30 seconds. If the fragment chain is not As UDP is an unreliable protocol by design,
completed during that interval, it will simply be NFS (or, more specifically, the RPC layer)
discarded. needs to deal with packet loss. This creates all
sorts of interesting problems, because we basi-
The bad thing is, on today’s network hardware, cally need to do all the things a reliable trans-
it is no big deal to send more than 65535 pack- port protocol does: retransmitting lost packets,
ets in under 30 seconds; in fact it is not un- flow control (if the NFS implementation sup-
common for the IPID counter to wrap around ports sending several requests in parallel), and
in 5 seconds or less. Assume a packet A, con- congestion avoidance. If you look at the RPC
taining an NFS READ reply is fragmented as implementation in the Linux kernel, you will
say A1 , A2 , A3 , and fragment A2 is lost. Then find a lot of things you may be familiar with
a few seconds later another NFS READ reply from a TCP context, such as slow start, or es-
is transmitted, which receives the same IPID, timators for round-trip times for more accurate
and is being fragmented as B1 , B2 , B3 . The re- timeouts.
ceiver will discard fragment B1 , because it al-
ready has a fragment chain for that IPID, and One of the less widely known problems with
the part of the packet represented by B1 is al- NFS over UDP however affected the file sys-
ready there. Then it will receive B2 , which is tem semantics. Consider a request to remove
56 • Why NFS Sucks

a directory, which the server dutifully per- socket without going into TIME_WAIT, which
formed and acknowledged. If the server’s reply allows it to reuse the same port immediately.
gets lost, the client will retransmit the request,
which will fail unexpectedly because the direc- Strictly speaking, this is in violation of the TCP
tory it is supposed to remove no longer exists! specification. While this avoids the problem
with the reply cache, it remains to be seen
Requests that will fail if retransmitted are called whether this entails any negative side effects—
non-idempotent. To prevent these from fail- for instance, how gracefully intermediate fire-
ing, a request replay cache was introduced in walls may deal with seeing SYN packets for
the NFS server, where replies to the most re- a connection that they think ought to be in
cent non-idempotent requests are cached. The TIME_WAIT.
NFS server identifies a retransmitted request by
checking the reply cache for an entry with the
same source address and port, and the same
RPC transaction ID (also known as the XID, a 6 Cache Consistency
32bit counter).

This provides reasonable protection for NFS As mentioned in the first section, NFS makes
over UDP as long as the cache is big enough no guarantees that all clients see exactly the
to hold replies for the client’s maximum re- same data at all times.
transmit timeout. As of the 2.6.16 kernel, the
Linux server’s reply cache is rather too small, Of course, during normal operation, accessing
but there is work underway to rewrite it. a file will show you the content that is actually
there, not some random gibberish. However, if
Interestingly, the reply cache is also useful two or more clients read and write the same file
when using TCP. TCP is not impacted the same simultaneously, NFS makes no effort to propa-
way UDP is, since retransmissions are handled gate all changes to all clients immediately.
by the network transport layer. Still, TCP con-
nections may break for various reasons, and the An NFS client is permitted to cache changes
server may find the client retransmit a request locally and send them to the server whenever
after reconnecting. it sees fit. This sort of lazy write-back greatly
helps write performance, but the flip side is
There is a little twist to this story. The TCP pro- that everyone else will be blissfully unaware
tocol specification requires that the host break- of these change before they hit the server. To
ing the connection does not reuse the same make things just a little harder, there is also no
port number for a certain time (twice the max- requirement for a client to transmit its cached
imum segment lifetime); this is also referred write in any particular fashion, so dirty pages
to as TIME_WAIT state. But usually you do can (and often will be) written out in random
not want to wait that long before reconnecting. order.
That means the new TCP connection will orig-
inate from a different port, and the server will And even once the modified data arrives at the
fail to find the retransmitted request in its cache. NFS server, not all clients will see this change
immediately. This is because the NFS server
To avoid that problem, the sunrpc code in re- does not keep track of who has a file open for
cent Linux kernels works around this by using reading and who does not (remember, we’re
a little known method for disconnecting a TCP stateless), so even if it wanted it cannot notify
2006 Linux Symposium, Volume Two • 57

clients of such a change. Therefore, it is the flushed to the server on closing the file, and a
client’s job to do regular checks if its cached cache revalidation occurs when you re-open it.
data is still valid.
One can hardly fail to notice that there is a lot of
So a client that has read the file once may con- handwaving in this sort of cache management.
tinue to use its cached copy of the file until This model is adequate for environments where
the next time it decides to check for a change. there is no concurrent read/write access by dif-
If that check reveals the file has changed, the ferent clients on the same file, such as when ex-
client is required to discard any cached data and porting users’ home directory, or a set of read-
retrieve the current copy from the server. only data.

The way an NFS client detects changes to a file However, this fails badly when applications
is peculiar as well. Again, as NFS is state- try to use NFS files concurrently, as some
less, there is no easy way to attach a mono- databases are known to do. This is simply
tonic counter or any other kind of versioning not within the scope of the NFS standards,
information to a file or directory. Instead, NFS and while NFSv3 and NFSv4 do improve some
clients usually store the file’s modification time aspects of cache consistency, these changes
and size along with the other cache details. At merely allow the client to cache more aggres-
regular intervals (usually somewhere between sively, but not necessarily more correctly. For
3 to 60 seconds), it performs a so-called cache instance, NFSv4 introduces the concept of del-
revalidation: The client retrieves the current set egations, which is basically a promise that the
of file attributes from the server and compares server will notify the client if some other host
the stored values to the current ones. If they opens the file for writing. Provided the server
match, it assumes the file has not changed and is willing and able to issue a delegation to the
the cached data is still valid. If there is a mis- client, this allows the client to cache all writes
match, all cached data is discarded, and dirty for as long as it holds that delegation. But after
pages are flushed to the server. the server revokes it, everyone just falls back to
the old NFSv3 behavior of mtime based cache
Unfortunately, most file systems store time revalidation.
stamps with second granularity, so clients will There is no really good solution to this prob-
fail to detect subsequent changes to a file if they lem; all solutions so far either involve turning
happen within the same wall-clock second as off caching to a fairly large degree, or extend-
their last revalidation. To compound the prob- ing the NFS protocol significantly.
lem, NFS clients usually hold on to the data
they have cached as long as they see fit. So Some documents recommend turning off
once the cache is out of sync with the server, caching entirely, by mounting the file system
it will continue to show this invalid informa- with the noac option, but this is really a
tion until the data is evicted from the cache to desparate measure, because it kills performance
make room, or until the file’s modification time completely.
changes again and forces the client to invalidate
its cache. Starting with the 2.6 kernel, the Linux NFS
client supports O_DIRECT mode for file I/O,
The only consistency guarantee made by NFS which turns off all read and write caching on a
is called close-to-open consistency, which file descriptor. This is slightly better than us-
means that any changes made by you are ing noac, as it still allows the caching of file
58 • Why NFS Sucks

attributes, but it means applications need to be not preserve the time stamp when copying files
modified and recompiled to use it. Its primary to NFS. The reason is the NFS write cache,
use is in the area of databases. which usually does not get flushed until the
file is closed. The way cp -p does its job is
Another approach to force a file to show a con- by creating the output file and writing all data
sistent view across different clients is to use first; then it calls utimes to set the modifica-
NFS file locking, because taking and releasing tion time stamp, and then closes the file. Now
a lock acts as a cache synchronization point. In close would see that there were still pend-
fact, in the Linux NFS client, the file unlock op- ing writes, and flush them out to the server,
eration actually implies a cache invalidation— clobbering the file’s mtime as a result. The
so this kind of synchronizyation is not exactly only viable fix for this is to make sure the NFS
free of cost either. client flushes all dirty pages before performing
the utimes update—in other words, utimes
Solutions involving changes to the NFS proto- acts like fsync.
col include Spritely NFS and NQNFS; but these
should probably considered as mostly research. Some other cases are a bit stranger. One such
It is questionable whether this gap in the NFS case is the ability to write to an open unlinked
design will ever be addressed, or whether this file. POSIX says an application can open a file
is left for others to solve, such as OCFS2, GFS for reading and writing, unlink it, and continue
or Lustre. to do I/O on it. The file is not supposed to go
away until the last application closes it.

This is difficult to do over NFS, since tradition-


7 POSIX Conformance ally, the NFS server has no concept of “open”
files (this was added in NFSv4, however). So
when a client removes a file, it will be gone for
People writing applications usually expect the good, and the file handle is no longer valid—
file system to “just work,” and will get slightly and and attempt to read from or write to that
upset if their application behaves differently on file will result in a “Stale file handle” error.
NFS than it does on a local file system. Of
course, everyone will have a slightly different The way NFS traditionally kludges around this
idea of what “just works” really is, but the is by doing what has been dubbed a “silly re-
POSIX standard is a reasonable approximation. name.” When the NFS client notices during an
unlink call that one or more applications still
NFS never claimed to be fully POSIX compli- hold an open file descriptor to this file, it will
ant, and given its rather liberal cache consis- not send a REMOVE call to the server. Instead,
tency guarantees, it never will. But still, it at- it will rename the file to some temporary file
tempts to conform to the standard as much as name, usually .nfsXXX where XXX is some
possible. hex number. This file will stay around until the
last application closes its open file descriptor,
Some of the gymnastics NFS needs to go and only then will the NFS client send the final
through in order to do so are just funny when REMOVE call to the server that gets rid of this
you look at them. For instance, consider the renamed file.
utimes call, which can be used by an appli-
cation to set a file’s modification time stamp. This sounds like a rather smart sleight of hand,
On some kernels, the command cp -p would and it is—up to a point. First off, this does not
2006 Linux Symposium, Volume Two • 59

work across different clients. But that should However, people who use ACLs usually want
not come as a surprise given the lack of cache to be able to view and modify them, too, with-
consistency. out having to log on to the server machine. NFS
protocol versions 2 and 3 do not provide any
Things get outright weird though if you con- mechanisms for queries or updates of ACLs
sider what happens when someone tries to un- at all, so different vendors devised their own
link such a .nfsXXX file. The Linux client side-band protocols that added this function-
does not allow this, in order to maintain POSIX ality. These are usually implemented as ad-
semantics as much as possible. The undesirable ditional RPC programs available on the same
side effect of this is that a rm -rf call will fail port as the NFS server itself. According to var-
to remove a directory if it contains a file that is ious sources, there were at least four different
currently open to some application. ACL protocols, all of them mutually incompat-
ible. So an SGI NFS client could do ACLs
But the weirdest part of POSIX conformance when talking to an SGI NFS server, or a So-
is probably the handling of access control lists, laris client could do the same when talking to a
and as such it deserves a section of its own. Solaris server.

Over the course of a few years, it seems the So-


laris ACL protocol has become the prevalent
8 Access Control Lists standard, if just by virtue of eliminating most
of the competition. The Linux ACL implemen-
tation adopted this protocol as well.
The POSIX.1e working group proposed a set of
operating system primitives that were supposed NFSv4 adds support for access control lists.
to enhance the Unix security model. Their But in its attempt to be a cross-platform dis-
work was never finished, but they did create a tributed file system, it adopted not the POSIX
legacy that kind of stuck—capabilities and ac- ACL model, but invented its own ACLs which
cess control lists (ACLs) being the promiment are much closer to the Windows ACL model
examples of their work. (which has richer semantics) than to the POSIX
model. It is not entirely compatible with Win-
Neither NFSv2 nor NFSv3 included support dows ACLs either, though.
for ACLs in their design. When NFSv2 was
designed, ACLs and mandatory access control The result of this is that it is not really easy to
were more or less an academic issue in the Unix do POSIX ACLs over NFSv4 either: there is
world, so they were simply not part of the spec- a mapping of POSIX to NFSv4 ACLs, but it
ification’s scope. is not really one-to-one, and somewhat awk-
ward. The other half of the problem is that
When NFSv3 was designed, ACLs were al- the server cannot map NFSv4 ACLs back to
ready being used more or less widely, and ac- POSIX ACLs, since they have much richer se-
knowledging that fact, a new protocol operation mantics. So it stores them in a different ex-
named ACCESS was introduced, which lets the tended attribute, which is not evaluated by the
client query a user’s permissions to perform a VFS (which currently does POSIX ACLs only).
certain operation. This at least allows a client As a consequence, NFSv4 ACLs will only be
to perform the correct access decisions in the enforced when the file system is accessed via
presence of access control lists on the server. NFSv4 at the moment. When accessing it via
60 • Why NFS Sucks

NFSv3 or locally on the server machines, these to accept these as legitimate—opening the door
ACLs are ignored. to replay attacks.

The ironic part of the story is that Sun, which A few years ago, a new RPC authentication fla-
was one of the driving forces behind the NFSv4 vor based on GSSAPI was defined and stan-
standard, added an NFSv4 version to their ACL dardized; it provides different levels of secu-
side band protocol which allows querying and rity, ranging from the old-style sort of authenti-
updating of POSIX ACLs, without having to cation restricted to the RPC header, to integrity
translate them to NFSv4 ACLs and back. and/or privacy. And since GSSAPI is agnostic
of the underlying security system, this authenti-
cation mechanism can be used to integrate NFS
security with any security system that provides
9 NFS Security a GSSAPI binding.

The Linux implementation of RPCSEC_GSS


One of the commonly voiced complaints over was developed as part of the NFSv4 project. It
NFS is the weak security model of the under- currently supports Kerberos 5, but work is un-
lying RPC transport. And indeed, security has derway to extend it to SPKM-3 and LIPKEY.
never been one of its strong points.
It is worth noting that GSS authentication is
The default authentication mechanism in RPC not an exclusive feature of NFSv4, it can be
is AUTH_SYS, also known as AUTH_UNIX be- enabled separately of NFSv4, and can be used
cause it basically conveys Unix style creden- with older versions of the protocol as well. On
tials, including user and group ID, and a list of the other hand, there remains some doubt as to
supplementary groups the user is in. However, whether there is really such a huge demand for
the server has no way to verify these creden- stronger NFS security, despite the vocal criti-
tials, it can either trust the client, or map all cism. Secure RPC was not perfect, but it has
user and group IDs to some untrusted account been available for ages on many platforms, and
(such as nobody). unlike Kerberos it was rather straightforward to
deploy. Still there were not that many site that
Stronger security flavors for RPC have been seriously made use of it.
around for a while, such as Sun’s “Secure
RPC,” which was based on a Diffie-Hellman
key management scheme and DES cryptogra-
phy to validate a user’s identity. Another se- 10 NFS File Locking
curity flavor that was used in some places re-
lied on Kerberos 4 credentials. Both of them Another operation that was not in the scope of
provided only a modicum of security however, the original NFS specification is file locking.
as the credentials were not tied in any way to Nobody has put forth an explanation why that
the packet payload, so that attackers could in- was so.
tercept a packet with valid credentials and mas-
sage the NFS request to do their own nefarious At some point, NFS engineers at Sun recog-
biddings. Moreover, the lack of high-resolution nized that it would be very useful to be able to
timers on average 1980s hardware meant that do distributed file locking, especially given the
most clients would often generate several pack- cache consistency semantics of the NFS proto-
ets with identical time stamps; so the server had col.
2006 Linux Symposium, Volume Two • 61

Subsequently, another side-band protocol Calling it a status monitor is a bit of a mis-


called the Network Lock Manager (NLM for nomer, as this is purely a reboot notification
short) protocol was devised, which implements service. NSM does not use any authentication
lock and unlock operations, as well as the either, and it its specification is a bit vague on
ability to notify a client when a previously how to identify hosts—either by address, which
blocked lock could be granted. NLM requests creates issues with multi-homed hosts, or by
are handled by the lockd service. name, which requires that all machines have
proper host names configured, and proper en-
NLM has a number of shortcomings. Probably tries in the DNS (which surprisingly often is not
the most glaring one is that it was designed for the case).
POSIX locks only; BSD flock locks are not
supported, since they have somewhat different NFSv4 does a lot better in this area, by finally
semantics. It is possible to emulate these with integrating file locking into the protocol, and
NLM, but it is non-trivial, and so far only Linux not relying on RPC callbacks to handle blocked
seems to do this. locks anymore. NFSv4 introduces a different
kind of callback as part of the delegation pro-
Another shortcoming is that most implemen- cess however, but at least those are optional and
tations do not bother with using any kind of NFSv4 still works in the presence of firewalls.
RPC security with NLM requests, so that a
lockd implementation has no choice but to ac-
cept unauthenticated requests, at least as long
as it wants to interoperate with other operating 11 AFS
systems.

Third, lockd does not only have to run on the AFS, the Andrew File System, was originally
server, it must be active on the client as well. developed jointly by Carnegie Mellon Univer-
That is because when a client blocked on a lock sity and IBM. It was probably never a huge
request, and the lock can later be granted, the success outside academia and research instal-
server is supposed to send a callback to the lations, despite the fact that the Open Group
client, so lockd must be active there as well. made it the basis of the distributed file system
This creates all kinds of headaches when doing for DCE (and charged an arm and a leg for it).
NFS through firewalls. Late in its life cycle, it was released by IBM
under an open source license, which managed
File locking is inherently a stateful operation, to breathe a little life back into it.
which does not go well with the statelessness
paradigm of the NFS protocol. In order to ad- AFS is a very featureful distributed file system.
dress this, mechanisms for lock reclaim were Among other things, it provides good security
added to NLM—if a NFS server reboots, there through the use of Kerberos 4, location inde-
is a so-called grace period during which clients pendent naming, and supports migration and
can re-register all the locks they were holding replication.
with the server.
On the down side, it comes with its own server
Obviously, in order to make this work, clients side storage file system, so that you cannot sim-
need to be notified when a server reboots. For ply export your favorite journaling file system
this, yet another side-band protocol was de- over AFS. Code portability, especially to 64bit
signed, called Network Status Monitor or NSM. platforms, and the sort of #ifdef accretion
62 • Why NFS Sucks

that can occur over the course of 20 years is dows machines. However, CIFS could be se-
also an issue. rious competition to NFS in the Linux world,
too—the biggest obstacle in this arena is not
a technical one, however, but the fact that it
is is controlled entirely by Microsoft, who like
12 CIFS
to spring the occasional surprise or two on the
open source world.
CIFS, the Common Internet File System, is
what was colloquially referred to as SMBfs
some time ago. Microsoft’s distributed file sys-
13 Cluster Filesystems
tem is session-based, and sticks closely to the
file system semantics of windows file systems.
Samba, and the Linux smbfs and cifs clients Another important area of development in the
have demonstrated that it is possible for Unix world of distributed file systems are clustered
platforms to interoperate with Windows ma- file systems such as Lustre, GFS and OCFS2.
chines using CIFS, but some things from the Especially the latter looks very interesting, as
POSIX world remain hard to map to their Win- its kernel component is relatively small and
dows equivalents and vice versa, with Access seems to be well-designed.
Control Lists (ACLs) being the most notorious
example. Cluster file systems are currently no replace-
ment for file systems such as NFS or CIFS, be-
CIFS provides some cache consistency through cause they usually require a lot more in terms of
the use of op-locks. It is a stateful protocol, infrastructure. Most of them do not scale very
and crash recovery is usually the job of the well beyond a few hundred nodes either.
application (we’re probably all familiar with
Abort/Retry/Ignore dialog boxes).

While CIFS was originally designed purely


14 Future NFS trends
with Windows file system semantics in mind,
it provides a protocol extension mechanisms The previous sections have probably made it
which can be used to implement support for abundantly clear that NFS is far from being
some POSIX concepts that cannot be mapped the perfect distributed file system. Still, in
onto the CIFS model. This mechanism has the Linux-to-Linux networking world, it is cur-
been used successfully by the Samba team to rently the best we have, despite all its shortcom-
provide better Linux to Linux operation over ings.
CIFS.
It will be interesting to see if it will continue to
The Linux 2.6 kernel comes with a new CIFS play an important role in this area, or if it will
implementation that is well along the way of re- be pushed aside by other distributed file sys-
placing the old smbfs code. As of this writing, tems.
the cifs client seems to have overcome most of
its initial stability issues, and while it is still Without doubt, NFSv4 will see wide-spread
missing a few features, it looks very promising. use in maybe a year from now. However, one
should remain sceptical on whether it will actu-
Without question, CIFS is the de-facto stan- ally meet its original goal of providing interop-
dard when it comes to interoperating with Win- erability with the Windows world. Not because
2006 Linux Symposium, Volume Two • 63

of any design shortcomings, but simply because 15 So how bad is it really?


CIFS is doing this already, and seems to be do-
ing its job quite well. In the long term, it may
be interesting to see if CIFS can take some bites This article claims to answer the question why
out of the NFS pie. The samba developers cer- NFS sucks. Hopefully, it has achieved this at
tainly think so. least partly; but the question that remains is,
how bad is it really, and how does NFSv4 help?

So indeed, a lot of the issues raised above are


There is also the question whether there is problems in NFSv2 and NFSv3, and have been
much incentive for end users to switch to addressed in NFSv4.
NFSv4. In the operational area, semantics have
not changed much; they mostly got more com- Still, several issues remain. The most promi-
plex. If users get any benefits from NFSv4, nent is the absence of real cache consistency.
it may not be from things like Windows in- NFSv4 supports delegations, but these do not
teroperability (which may turn out to be more solve the problem; instead they allow the client
of a liability than a benefit). Instead, users to do more efficient caching if there are no con-
would probably benefit a lot more from other flicting accesses.
new features of the protocol, such as support
for replication and migration. It is worth not- Another issue is NFSv4 ACLs, which are nei-
ing, however, that while the NFSv4 RFC pro- ther POSIX nor CIFS compatible, and there-
vides the hooks for informing clients about mi- fore require either an elaborate and fragile map-
gration of a file system, it does not define the ping for Linux to take advantage of them, or a
migration mechanisms themselves. Unfortu- continued use of the nfsacl side band protocol.
nately, the RFC 3010 does not talk about prox- There is also no mechanism to enforce NFSv4
ying, which would have been a real benefit. ACLs locally, or via NFSv3.

The third problem is the continued use of RPC.


In theory, it should be possible to perform call-
The adoption of RPCSEC_GSS will definitely backs over an established TCP connection—
be a major benefit in terms of security. While callbacks are just another type of message.
GSS with Kerberos may not see wide deploy- However, this is not the way RPC is modeled,
ment, simply because of the administrative and thus the server needs to establish a connec-
overhead of running a Kerberos service, other tion to a service port on the client. This cre-
GSS mechanisms such as LIPKEY may pro- ates problems with firewalls, and makes for un-
vide just the right trade-off between security happy security officers who would like to see as
and ease of use that make them worthwhile to few open ports on client machines as possible.
small to medium sized networks.
Without RPC, NFS could possibly also handle
the reply cache more efficiently and robustly. A
Other interesting areas of NFS development in better session protocol would be able to detect
Linux include the RPC transport switch, which reliably whether a request is a retransmission;
allows the RPC layer to use transports other whether a client has rebooted and it is hence
than UDP and TCP over IPv4. The primary a good idea to discard all cached replies; and
goals in this area are NFS over IPv6, and us- to identify clients by means other than their IP
ing Infiniband/RDMA as a transport. address and port number.
64 • Why NFS Sucks
Proceedings of the
Linux Symposium

Volume Two

July 19th–22nd, 2006


Ottawa, Ontario
Canada
Conference Organizers
Andrew J. Hutton, Steamballoon, Inc.
C. Craig Ross, Linux Symposium

Review Committee
Jeff Garzik, Red Hat Software
Gerrit Huizenga, IBM
Dave Jones, Red Hat Software
Ben LaHaise, Intel Corporation
Matt Mackall, Selenic Consulting
Patrick Mochel, Intel Corporation
C. Craig Ross, Linux Symposium
Andrew Hutton, Steamballoon, Inc.

Proceedings Formatting Team


John W. Lockhart, Red Hat, Inc.
David M. Fellows, Fellows and Carr, Inc.
Kyle McMartin

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights
to all as a condition of submission.

You might also like