Distributed File Systems Design
Distributed File Systems Design
Introduction
Presently, our most common exposure to distributed systems that exemplify some degree of transparency is through distributed file systems. We'd like remote files to look and feel just like local ones. A file system is responsible for the organization, storage, retrieval, naming, sharing, and protection of files. File systems provide directory services, which convert a file name (possibly a hierarchical one) into an internal identifier (e.g. inode, FAT index). They contain a representation of the file data itself and methods for accessing it (read/write). The file system is responsible for controlling access to the data and for performing low-level operations such as buffering frequentlyused data and issuing disk I/O requests. Our goals in designing a distributed file system are to present certain degrees of transparency to the user and the system: access transparency Clients are unaware that files are distributed and can access them in the same way as local files are accessed. location transparency A consistent name space exists encompassing local as well as remote files. The name of a file does not give it location. concurrency transparency All clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. failure transparency The client and client programs should operate correctly after a server failure. heterogeneity File service should be provided across different hardware and operating system platforms. scalability The file system should work well in small environments (1 machine, a dozen machines) and also scale gracefully to huge ones (hundreds through tens of thousands of systems). replication transparency To support scalability, we may wish to replicate files across multiple servers. Clients should be unaware of this. migration transparency Files should be able to move around without the client's knowledge. support fine-grained distribution of data
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
To optimize performance, we may wish to locate individual objects near the processes that use them. tolerance for network partitioning The entire network or certain segments of it may be unavailable to a client during certain periods (e.g. disconnected operation of a laptop). The file system should be tolerant of this.
Naming issues
In designing a distributed file service, we should consider whether all machines (and processes) should have the exact same view of the directory hierarchy. We might also wish to consider whether the name space on all machines should have a global root directory (a.k.a. super root) so that files can be accessed as, for example, //server/path. This is a model that was adopted by the Apollo Domain System, an early distributed file system, and more recently by the web community in the construction of a uniform resource locator ( URL ). In considering our goals in name resolution, we must distinguish between location transparency and location independence. By location transparency we mean that the path name of a file gives no hint to where the file is located. For instance, we may refer to a file as //server1/dir/file. The server ( server) can move anywhere without the client caring, so we have location transparency. However, if the file moves to server2 things will not work. If we have location independence, the files can be moved without their names changing. Hence, if machine or server names are embedded into path names we do not achieve location independence. It is desirable to have access transparency, so that applications and users can access remote files just as they access local files. To facilitate this, the remote file system name space should be syntactically consistent with the local name space. One way of accomplishing this is by redefining the way files are named and require an explicit syntax for identifying remote files. This can cause legacy applications to fail and user discontent (users will have to learn a new way of naming their files). An alternate solution is to use a file system mounting mechanism to overlay portions of another file system over a node in a local directory structure. Mounting is used in the local environment to construct a uniform name space from separate file systems (which reside on different disks or partitions) as well as incorporating special-purpose file systems into the name space (e.g. /proc on many U NIX systems allows file system access to processes). A remote file system can be mounted at a particular point in the local directory tree. Attempts to access files and directories under that node will be directed to the driver for that file system. To summarize, our naming options are: machine and path naming ( machine:path, ./machine/path). mount remote file systems onto the local directory hierarchy (merging the two name spaces). provide a single name space which looks the same on all machines. The first two of these options are relatively easy to implement.
Types of names
When we talk about file names, we refer to symbolic names (for example, server.c). These names are used by people (users or programmers) to refer to files. Another name is the identifier used by the system internally to refer to a file. We can think of this as a binary name (more precisely, as an address). On most UNIX file systems, this would be the device number and inode number. On MS-DOS systems, this would be the drive letter and FAT index. Directories provide a mapping from symbolic names to file addresses (binary names). Typically, one symbolic name maps to one file address. If multiple symbolic names map onto one binary name, these are called hard links. On inode-based file systems (e.g., most UNIX systems), hard links must exist within the same device since the address (inode) is unique only on that device. On MS-DOS systems, they are not supported because file attributes are stored with the name
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
of the file. Having two symbolic names refer to the same data will cause problems in synchronizing file attributes (how would you locate other files that point to this data?). A hack to allow multiple names to refer to the same file (whether its on the same device or a different device) is to have the symbolic name refer to a single file address but that file may have an attribute to tell the system that its contents contain a symbolic file name that should be dereferenced. Essentially, this adds a level of indirection: access a file which contains another file name, which references the file attributes and data. These files are known as symbolic links. Finally, it is possible for one symbolic name to refer to multiple file addresses. This doesn't make much sense on a local system 1, but can be useful on a networked file system to provide fault tolerance or enable the system to use the file address which is most efficient.
It really does make sense in a way. In the late 1980s, David Korn created a file system that allowed multiple directories to be mounted over the same directory node. The Plan 9 operating system later adopted this technique and called it union mounts. One thing it is useful for is getting rid of the PATH environment variable in searching for executables; the executables are always found in /bin. The name is resolved by searching through the file systems (directories) mounted on that node in a last-mounted, first-searched order. 2 The Bullet server on the Amba operating system is an example of a system that uses immutable files.
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
transactions start at the same time, the system ensures that the end result is as if they were run in some sequential order. All changes have an all or nothing property.
In looking up the pathname of a file (e.g. via the namei function in the UNIX kernel ), we may choose to evaluate a pathname a component at a time. For example, for a pathname aaa/bbb/ccc, we would perform a remote lookup of aaa, then another one of bbb, and finally one of ccc). Alternatively, we may pass the rest of the pathname to the remote machine as one lookup request once we find that a component is remote. The drawback of the latter scheme is (a) the remote server may be asked to walk up the tree by processing .. (parent node) components and reveal more of its file system than it wants and (b) other components cannot be mounted underneath the remote tree on the local system. Because of this, component at a time evaluation is
generally favored but it has performance problems (a lot more messages). We may choose to keep a local cache of component resolutions.
Should servers maintain state?
This issue is a topic of passionate debate. A stateless system is one in which the client sends a request to a server, the server carries it out, and returns the result. Between these requests, no client-specific information is stored on the server. A stateful system is one where information about client connections is maintained on the server. In a stateless system: Each request must be complete the file has to be fully identified and any offsets specified. Fault tolerance: if a server crashes and then recovers, no state was lost about client connections because there was no state to maintain. No remote open/close calls are needed (they only serve to establish state). No wasted server space per client. No limit on the number of open files on the server; they aren't open the server maintains no per-client state. No problems if the client crashes. The server does not have any state to clean up. On a stateful system: requests are shorter (less info to send). better performance in processing the requests. idempotency works; cache coherence is possible. file locking is possible; the server can keep state that a certain client is locking a file (or portion thereof).
Caching
We can employ caching to improve system performance. There are four places in a distributed system where we can hold data: 1. on the server's disk 2. in a cache in the server's memory 3. in the client's memory 4. on the client's disk The first two places are not an issue since any interface to the server can check the centralized cache. It is in the last two places that problems arise and we have to consider the issue of cache consistency. Several approaches may be taken: write-through What if another client reads its own cached copy? All accesses would require checking with the server first (adds network congestion) or require the server to maintain state on who has what files cached. Write-through also does not alleviate congestion on writes. delayed writes Data can be buffered locally (where consistency suffers) but files can be updated periodically. A single bulk write is far more efficient than lots of little writes every time any file contents are modified. Unfortunately the semantics become ambiguous.
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
write on close This is admitting that the file system uses session semantics. centralized control Server keeps track of who has what open in which mode. We would have to support a stateful system and deal with signaling traffic.
devices? Since NFS had to support diskless workstations, where every file is remote, remote device files had to refer to the client's local devices. Otherwise there would be no way to access local devices in a diskless environment.
NFS protocols
The NFS client and server communicate over remote procedure calls (Suns RPC) using two protocols: the mounting protocol and the directory and file access protocol. The mounting protocol is used to request a access to an exported directory (and the files and directories within that file system under that directory). The directory and file access protocol is used for accessing the files and directories (e.g. read/write bytes, create files, etc.). The use of RPCs external data representation (XDR) allows NFS to communicate with heterogeneous machines. The initial design of NFS ran only with remote procedure calls over UDP. This was done for two reasons. The first reason is that UDP is somewhat faster than TCP but does not provide error correction (the UDP header provides a checksum of the data and headers). The second reason is that UDP does not require a connection to be present. This means that the server does not need to keep per-client connection state and there is no need to reestablish a connection if a server was rebooted. The lack of UDP error correction is remedied in the fact that remote procedure calls have built-in retry logic. The client can specify the maximum number of retries (default is 5) and a timeout period. If a valid response is not received within the timeout period the request is re-sent. To avoid server overload, the timeout period is then doubled. The retry continues until the limit has been reached. This same logic keeps NFS clients fault-tolerant in the presence of server failures: a client will keep retrying until the server responds.
mounting protocol
The client sends the pathname to the server and requests permission to access the contents of that directory. If the name is valid and exported (listed in /etc/dfs/sharetab on System V release 4 versions of U NIX , and /etc/exports on many other versions) the server returns a file handle to the client. This file handle contains all the information needed to identify the file on the server: {file system type, disk ID, inode number, security info}. Mounting an NFS file system is accomplished by parsing the path name, contacting the remote machine for a file handle, and creating an in-core vnode at the mount point. A vnode points to an inode for a local U NIX file or, in the case of NFS, an rnode. The rnode contains specific information about the state of the file from the point of view of the client. Two forms of mounting are supported: static In this case, file systems are mounted with the mount command (generally during system boot). automounting One problem with static mounting is that if a client has a lot of remote resources mounted, boot-time can be excessive, particularly if any of the remote systems are not responding and the client keeps retrying. Another problem is that each machine has to maintain its own name space. If an administrator wants all machines to have the same name space, this can be an administrative headache. To combat these problems the automounter was introduced.
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
The automounter allows mounts and unmounts to be performed in response to client requests. A set of remote directories is associated with a local directory. None are mounted initially. the first time any of these is referenced, the operating system sends a message to each of the servers. The first reply wins and that file system gets mounted (it is up to the administrator to ensure that all file systems are the same). To configure this, the automounter relies on mapping files that provide a mapping of client pathname to the server file system. These maps can be shared to facilitate providing a uniform naming space to a number of clients.
directory and file access protocol
Clients send RPC messages to the server to manipulate files and directories. A file is accessed by performing a lookup remote procedure call. This returns a file handle and attributes. It is not like an open in that no information is stored in any system tables on the server. After that, the handle may be passed as a parameter for other functions. For example, a read(handle, offset, count) function will read count bytes from location offset in the file referred to by handle. 3 The entire directory and file access protocol is encapsulated in sixteen functions . These are: null no-operation but ensure that connectivity exists lookup create remove rename read write link symlink readlink mkdir rmdir readdur getattr setattr statfs
Accessing files
lookup the file name in a directory create a file or a symbolic link remove a file from a directory rename a file or directory read bytes from a file write bytes to a file create a link to a file create a symbolic link to a file read the data in a symbolic link (do not follow the link) create a directory remove a directory read from a directory get attributes about a file or directory (type, access and modify times, and access permissions) set file attributes get information about the remote file system
Files are accessed through conventional system calls (thus providing access transparency). If you recall conventional U NIX systems, a hierarchical pathname is dereferenced to the file location with a kernel function called namei. This function maintains a reference to a current directory,
3
These functions are present in versions 2 and 3 of the NFS protocol. Version 3 added six more functions.
looks at one component and finds it in the directory, changes the reference to that directory, and continues until the entire path is resolved. At each point in traversing this pathname, it checks to see whether the component is a mount point, meaning that name resolution should continue on another file system. In the case of NFS, it continues with remote procedure calls to the server hosting that file system. Upon realizing that the rest of the pathname is remote, namei will continue to parse one component of the pathname at a time to ensure that references to .. and to symbolic links become local if necessary. Each component is retrieved via a remote procedure call which performs an NFS lookup. This procedure returns a file handle. An in-core rnode is created and the VFS layer in the file system creates a vnode to point to it. The application can now issue read and write system calls. The file descriptor in the users process will reference the in-core vnode at the VFS layer, which in turn will reference the in-core rnode at the NFS level which contains NFS-specific information, such as the file handle. At the NFS level, NFS read, write, etc. operations may now be performed, passing the file handle and local state (such as file offset) as parameters. No information is maintained on the server between requests; it is a stateless system. The RPC requests have the user ID and group ID number sent with them. This is a security hole that may be stopped by turning on RPC encryption.
Performance
NFS performance was generally found to be slower than accessing local files because of the network overhead. To improve performance, reduce network congestion, and reduce server load, file data is cached at the client. Entire pathnames are also cached at the client to improve performance for directory lookups. server caching Server caching is automatic at the server in that the same buffer cache is used as for all other files on the server. The difference for NFS-related writes is that they are all writethrough to avoid unexpected data loss if the server dies. client caching The goal of client caching is to reduce the amount of remote operations. Three forms of information are cached at the client: file data, file attribute information, and pathname bindings. We cache the results of read, readlink, getattr, lookup, and readdir operations. The danger with caching is that inconsistencies may arise. NFS tries to avoid inconsistencies (and/or increase performance) with: validation - if caching one or more blocks of a file, save a time stamp. When a file is opened or if the server is contacted for a new data block, compare the last modification time. If the remote modification time is more recent, invalidate the cache. Validation is performed every three seconds on open files. Cached data blocks are assumed to be valid for three seconds. Cached directory blocks are assumed to be valid for thirty seconds. Whenever a page is modified, it is marked dirty and scheduled to be written (asynchronously). The page is flushed when the file is closed.
10
Transfers of data are done in large chunks; the default is 8K bytes. As soon as a chunk is received, the client immediately requests the next 8K-byte chunk. This is known as read-ahead. The assumption is that most file accesses are sequential and we might as well fetch the next block of data while we're working on our current block, anticipating that we'll likely need it. This way, by the time we do, it will either be there or we don't have to wait too long for it since it's on its way.
Problems
The biggest problem with NFS is file consistency. The caching and validation policies do not guarantee session semantics. NFS assumes that clocks between machines are synchronized and performs no clock synchronization between client and server. One place where this hurts is in distributed software development environments. A program such as make, which compares times of files (such as object and source) to determine whether to regenerate them, can either fail or give confusing results. Because of its stateless design, open with append mode cannot be guaranteed to work. You can open a file, get the attributes (size), and then write at that offset, but you'll have no assurance that somebody else did not write to that location after you received the attributes. In that case your write will overwrite the other once since it will go to the old end-of-file byte offset. Also because of its stateless nature, file locking cannot work. File locking implies that the server keeps track of which processes have locks on the file. Sun's solution to this was to provide a separate process (a lock manager) that does keep state. One common programming practice under U NIX file systems for manipulating temporary data in files is to open a temporary file and then remove it from the directory. The name is gone, but the data persists because you still have the file open. Under NFS, the server maintains no state about remotely opened files and removing a file will cause the file to disappear. Since legacy applications depended on this, Sun's solution was to create a special hack for U NIX : if the same process that has a file open attempts to delete it, it is instead moved to a temporary name and deleted on close. It's not a perfect solution, but it works well. Permission bits might change on the server and disallow future access to a file. Since NFS is stateless, it has to check access permissions each time it receives an NFS request. With local file systems, once access is granted initially, a process can continue accessing the file even if permissions change. By default, no data is encrypted and Unix-style authentication is used (used ID, group ID). NFS supports two additional forms of authentication: Diffie-Hellman and Kerberos. However, data is never encrypted and user-level software should be used to encrypt files if this is necessary.
More fixes
The original version of NFS was released in 1985, with version 2 released around 1988. In 1992, NFS was enhanced to version 3 (SunOS 5.5). Several changes were added to enhance the performance of the system: 1. NFS was enhanced to support TCP. UDP caused more problems over wide-area networks than it did over LANs because of errors. To combat that and to support larger data transfer sizes, NFS was modified to support TCP as well as UDP. To minimize connection setup, all traffic can be multiplexed over one TCP connection. 2. NFS always relied on the system buffer cache for caching file data. The buffer cache is often not very large and useful data was getting flushed because of the size of the cache. Sun introduced a caching file system, CacheFS, that provides more caching capability by using the disk.
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
11
3.
4. 5. 6.
7.
Memory is still used as before, but a local disk is used if more cache space is needed. Data can be cached in chunks as large as 64K bytes and entire directories can be cached. NFS was modified to support asynchronous writes. If a client needed to send several write requests to a server, it would send them one after another. The server would respond to a request only after the data was flushed to the disk. Now multiple writes can be collected and sent as an aggregate request to the server. The server does not have to ensure that the data is on stable storage (disk) until it receives a commit request from the client. File attributes are returned with each remote procedure call now. The overhead is slight and saves clients from having to request file attributes separately (which was a common operation). Version 3 allows 64-bit rather than the old 32-bit file offsets (supporting file sizes over 18 million terabytes). An enhanced lock manager was added to provide monitored locks. A status monitor monitors hosts with locks and informs a lock manager of a system crash. If a server crashes, the status monitor reinstates locks on recovery. If a client crashes, all locks from that client are freed on the server. A few more NFS functions were added: access check access permissions for a server mknod create a special device file on a server readdirplus extended read from a directory fsinfo get file system state information (static, as opposed to fsstat, which returns dynamic information) commit commit the cached data on the server to stable storage (e.g. disk)
12
the file is really a device and contains numbers to identify the kernel drivers. If one is accessing a remote file system and accesses a remote device file, should the access be to a local or to a remote device? AT&T's response was that it would be useful if access to a device on a remote file system would access the remote device. After all, you can always access your local devices through a local file system. Accessing remote devices allows for easy sharing of devices such as backup tape drives and printers. While Sun may have thought that accessing remote devices would be convenient, it had more important things to think about, namely its hardware. Sun's initial claim to fame was the development of the diskless workstation. A diskless workstation loads the operating system from some server and, upon booting, mounts all its file systems as remote (NFS) file systems since it does not have a local disk. In this environment it is important that local hardware could be accessed (the keyboard, mouse, display, etc.). The only way this could be done is by making a rule that remote device files really refer to local devices. While this solves the diskless problem, it does not allow for sharing devices through the file system. It also has the problem that if a client's kernel device numbers for a particular device differ from the server's, the wrong device may be accessed. RFS or NFS? Which is better? The answer is it depends. If support for heterogeneous operating systems is desired, NFS is the winner. On the other hand, if support for UNIX file system semantics is important (cache coherency, file locking that works) then RFS wins. If complete support of every feature in a U NIX System V file system is required, RFS provides it; NFS caters to the lowest-common-denominator of file system operations. It support of remote devices is required, RFS can provide it while NFS will attempt to access the corresponding local device. RFS, because of its connection-oriented nature, is sensitive to server crashes. If an application has a remote file open and the server crashes, the mounted remote file system is unmounted and any open or closed files are immediately inaccessible. With NFS, the NFS client simply keeps issuing NFS RPC requests until the server comes up again. Looking at the marketplace, NFS has completely overshadowed RFS, primarily because of its easy portability, the popularity of Suns in the late 1980s over machines running U NIX , System V, and the fault tolerance of NFS.
13
Implementation
The client's machine has one disk partition devoted to the AFS cache (for example, 100M bytes, or whatever the client can spare). The client software manages this cache in an LRU (least recently used) manner and the clients communicate with a set of trusted servers. Each server presents a location-transparent UNIX (hierarchical) file name space to its clients. On the server, each physical disk partition contains files and directories that can be grouped into one or more volumes. A volume is nothing mmore than an administrative unit of organization (e.g., a users home directory, a local source tree). Each volume has a directory structure (a rooted hierarchy of files and directories) and is given a name and ID. Servers are grouped into administrative entities called cells. A cell is a collection of servers, administrators, clients, and users. Each cell is autonomous but cells may cooperate and present users with one uniform name space. The goal is that every client will see the same name space (by convention, under a directory /afs). Listing the directory /afs shows the participating cells (e.g., /afs/mit.edu). Each file and directory is identified by three 32-bit numbers: volume ID: This identifies the volume to which the object belongs. The client caches the binding between volume ID and server, but the server is responsible for maintaining the bindings. vnode ID: This is the handle (vnode number) that refers to the file on a particular server and disk partition (volume). uniquifier: This is a unique number to ensure that the same vnode IDs are not reused. Each server maintains a copy of a database that maps a volume number to its server. If the client request is incorrect (because a volume moved to a different server), the server forwards the request. This provides AFS with migration transparence: volumes may be moved between servers without disrupting access. Communication in AFS is with RPCs via UDP. Access control lists are used for protection; UNIX file permissions are ignored. The granularity of access control is directory based; the access rights apply to all files in the directory. Users may be members of groups and access rights specified for a group. Kerberos is used for authentication.
Cache coherence
The server copies a file to the client and provides a callback promise: it will notify the client when any other process modifies the file. When a server gets an update from a client, it notifies all the clients by sending a callback (via RPC). Each client that receives the callback then invalidates the cached file. If a client that had a file cached was down, on restart, it contacts the server with the timestamps of each cached file to decide whether to invalidate the file. Note that if an process as a file open, it can continue using it, even if it has been invalidated in the cache. Upon close, the contents will still be propagated to the server. There is no further mechanism for coherency. AFS abides by session semantics. Under AFS, read-only files may be replicated on multiple servers.
14
Whole file caching isn't feasible for very large files, so AFS caches files in 64K byte chunks (by default) and directories in their entirety. File modifications are propagated only on close. Directory modifications are propagated immediately. AFS does not support byte-range file locking. Advisory file locking (query to see whether a file has a lock on it) is supported.
AFS Summary
AFS demonstrates that whole file (or large chunk) caching offers dramatically reduced loads on servers, creating an environment that scales well. The AFS file system provides a uniform name space from all workstations, unlike NFS, where the client mount each NFS file system at a clientspecific location (the name space is uniform only under the /afs directory, however). Establishing the same view of the file name space from each client is easier than with NFS. This enables users to move to different workstations and see the same view of the file system. Access permission is handled through control lists per directory, but there is no per-file access control. Workstation/user authentication is performed via the Kerberos authentication protocol using a trusted third party (more on this in the security section). A limited form of replication is supported. Replicating read-only (and read-mostly at your own risk) files can alleviate some performance bottlenecks for commonly accessed files (e.g. password files, system binaries).
Coda
Coda is a descendent of AFS, also created at CMU (c. 1990-1992). Its goals are: - Provide better support for replication of file volumes than offered by AFS. AFS limited form (read-only volumes) of replication will be a limiting factor in scaling the system. We would like to support widely shared read/write files, such as those found in bulletinboard systems. - Provide constant data availability in disconnected environments through hoarding (user-directed caching). This requires logging updates on the client and reintegration when the client is reconnected to the network. Such a scheme will support the mobility of PCs. - Improve fault tolerance. Failed servers and network problems shouldn't seriously inconvenience users. To achieve these goals, AFS was modified in two substantial ways: 1. File volumes can be replicated to achieve higher throughput of file access operations and improve fault tolerance. 2. The caching mechanism was extended to enable disconnected clients to operate. Volumes can be replicated to group of servers. The set of servers that can host a particular volume is the volume storage group (VSG) for that volume. In identifying files and directories, a client no longer uses a volume ID as AFS did, but instead uses a replicated volume ID. The client performs a one-time lookup to map the replicated volume ID to a list of servers and local volume IDs. This list is cached for efficiency. Read operations can take place from any of these servers to distribute the load. A write operation has to be multicast to all available servers. Since some servers
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
15
may be inaccessible at a particular point in time, a client may be able to access only a subset of the VSG. This subset is known as the Available Volume Storage Group, or AVSG. Since some volume servers may be inaccessible, special treatment is needed to ensure that clients do not read obsolete data. Each file copy has a version stamp. Before fetching a file, a client requests version stamps for that file from all available servers. If some servers are found to have old versions, the client initiates a resolution process which tries to automatically resolve differences (administrative intervention may be required if the process finds problems that it cannot fix). Resolution is only initiated by the client. The process is handled entirely by the servers.
Disconnected operation
If a clients AVSG is empty, then the client is operating in a disconnected operation mode. If a file is not cached locally and is needed, nothing can be done: the system simply retries access and fails. For writes, however, the client does not report a failure of an update. Instead, the client logs the update locally in a Client Modification Log (CML). The user is oblivious to this. On reconnection, a process of reintegration with the server(s) commences to bring the server up to date. The CML is played back (the log playback is optimized so that only the latest changes are sent). The system tries to resolve conflicts automatically. This is not always possible (for example, someone may have modified the same parts of the file on a server while our client was disconnected). In cases where conflicts arise, user intervention is required to reconcile the differences. To further support disconnected operation, it is desirable to cache all the files that will be needed for work to proceed when disconnected and keep them up to date even if they are not being actively used. To do this, Coda supports a hoard database that contains a list of these important files. The hoard database is constructed both by monitoring a users file activity and allowing a user to explicitly specify files and directories that should be present on the client. The client frequently asks the server to send updates if necessary (that is, when it receives a callback).
16
server grants and revokes tokens. It will grant any number of read tokens to clients but as soon as one client requests write access, the server will revoke all outstanding read and write tokens and issue a single write token to the requestor. This token scheme makes long term caching possible (it is not under NFS). Caching is in units of chunk sizes that range from 8K to 256K bytes. Caching is both in client memory and on the disk. DFS also employs read-ahead (similar to NFS) to attempt to bring additional chunks off the file to the client before they are needed. DFS is integrated with DCE security services. File protection is via access control lists (ACL) and all communication between client and server is via authenticated remote procedure calls.
17
presented for all future accesses. Microsoft, Compaq, SCO, and a number of other companies are currently developing a public version of the SMB protocol, called CIFS (Common Internet File System).
18
level II oplock
allows multiple clients to have the same file open as long as none are writeing to the file. It tells the client that there are multiple concurrent clients, none of whom have modified the file (read access). Local caching of reads as well as read-ahead are allowed. All other operations must be sent directly to the server. allows the client to keep the file open on the server even if a local process that was using it has closed the file. A client requests a batch oplock if it expects that programs may behave in a way that generates a lot of traffic (accessing the same file over and over). This oplock tells the client that it is the only one with the file open. All operations may be done on cached data and data may be cached indefinitely. tells the client that other clients may be writing data to the file: all requests other than reads must be sent to the server. Read operations may work from a local cache only if the byte range was locked by the client.
batch oplock
no oplocks
The server has the right to asynchronously send a message to the client changing the oplock. For example, a client may be granted an exclusive oplock initially since nobody else was accessing the file. Later on, when another client opened the same file for read, the oplock was changed to a level II oplock. When another client opened the file for writing, both of the earlier clients were sent a message revoking their oplocks.
References
NFS The NFS Distributed File Service: NFS White Paper, Sun Microsystems, March 1995 http://www.sun.com/software/white-papers/wp-nfs/ RFC 1094: NFS: Network File System Protocol Specification, Sun Microsystems, March 1989 One place to get this is http://sunsite.auc.dk/RFC/rfc/rfc1094.html AFS IBM's Transarc division provides the commercial AFS product. Some overview info on AFS as well as DFS can be found at http://www.transarc.ibm.com/Product/EFS/index.html Carneigie Mellon's AFS Reference Page: http://www.cs.cmu.edu/afs/andrew.cmu.edu/usr/shadow/www/afs.html The AFS File System in Distributed Computing Environments, Transarc Corporation, 1996. (This paper includes a comparison of AFS with NFS.)
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
19
http://www.transarc.ibm.com/Library/whitepapers/AFS/afswp.html CIFS A Common Internet File System (CIFS/1.0) Protocol, Paul J. Leach, Microsoft. Preliminary Draft, posted December 19, 1997 http://www.thursby.com/cifs/file/ Coda Information about the Coda project at CMU can be found at http://www.coda.cs.cmu.edu/
An index of Coda papers and project updates, including information about Odyssey, a follow-on project to Coda is available at: http://www.cs.cmu.edu/afs/cs/project/coda/Web/coda.html
Coda: A Highly Available File System for a Distributed Workstation Environment, Satyanarayanan, M., Proceeding of the Second IEEE Workshop on Workstation Operating Systems, Sept. 1989, http://www.cs.cmu.edu/afs/cs/project/coda/Web/docdir/wwos2.ps.Z Coda: A Resilient Distributed File System, Satyanarayan, M., Kistler, J.J, Siegel, E.H., IEEE Workshop on Workstation Operating Systems, http://www.cs.cmu.edu/afs/cs/project/coda/Web/docdir/wwos1-fulltext.html DCE A Distributed Computing Environment Framework: An OSF Perspective, Brad Curtis Johnson, OSF, June, 1991 http://www.opengroup.org/dce/info/papers/dev-dce-tp6-1.ps Towards a Worldwide Distributed File System:: The OSF DCE File System as an example, Norbert Leser, DCE Evaluation Team, Open Software Foundation, September, 1990 http://www.opengroup.org/dce/info/papers/dev-dce-tp4-1.ps Performance Characteristics of the DCE Distributed File Service, by Agustin Mena III and Carl Burnett, IBM Corp. http://www.networking.ibm.com/dce/dcedfspf.html
High level description of DFS and a whole bunch of benchmarks against Sun's NFS.
SMB The main web reference for Samba (SMB) is (most likely):
Rutgers University CS 417: Distributed Systems 2000-2002 Paul Krzyzanowski
20
http://anu.samba.org/ Just what is SMB?, Richard Sharpe, v1.1, 14 May 1998 http://anu.samba.org/cifs/docs/what-is-smb.html Miscellany Networking Applications on UNIX System V Release 4, Michael Padovano, 1993 Prentice Hall.
This is a good reference for architectural discussions of RFS and NFS as well as administration and application programming.
UNIX Network Programming, W. Richard Stevens, © 1990 Orentice Hall. Distributed Operating Systems, Andrew Tanenbaum, ; 1995 Prentice Hall. WebNFS: The Filesystem for the Internet, Sun Microsystems, April 1997, http://www.sun.com/webnfs/wp-webnfs/
21