16 Distributedfilesystems
16 Distributedfilesystems
11/24/2008 2
Issues
• What is the basic abstraction • Caching
– remote file system? – caching exists for performance reasons
• open, close, read, write, … – where are file blocks cached?
– remote disk? • on the file server?
• read block, write block • on the client machine?
• Naming • both?
11/24/2008 3 11/24/2008 4
1
Example: SUN Network File System (NFS)
• Replication • The Sun Network File System (NFS) has become a
– replication can exist for performance and/or availability common standard for distributed UNIX file access
– can there be multiple copies of a file in the network? • NFS runs over LANs (even over WANs – slowly)
– if multiple copies, how are updates handled?
• Basic idea
– what if there’s a network partition and clients work on
– allow a remote directory to be “mounted” (spliced) onto a
separate copies?
local directory
• Performance – Gives access to that remote directory and all its descendants
– what is the cost of remote operation? as if they were part of the local hierarchy
– what is the cost of file sharing? • Pretty much exactly like a “local mount” or “link” on
– how does the system scale as the number of clients grows? UNIX
– what are the performance limitations: network, CPU, disks, – except for implementation and performance …
protocols, data copying?
– no, we didn’t really learn about these, but they’re obvious ☺
11/24/2008 5 11/24/2008 6
NFS implementation
• For instance: • NFS defines a set of RPC operations for remote file
– I mount /u4/levy on Node1 onto /students/foo on Node2 access:
– users on Node2 can then access this directory as – searching a directory
/students/foo
– reading directory entries
– if I had a file /u4/levy/myfile, users on Node2 see it as
/students/foo/myfile – manipulating links and directories
• Just as, on a local system, I might link – reading/writing files
/cse/www/education/courses/451/08au/ • Every node may be both a client and server
as
/u4/levy/451
to allow easy access to my web data from my home
directory
11/24/2008 7 11/24/2008 8
2
NFS caching / sharing
• NFS defines new layers in the Unix file system • On an open, the client asks the server whether its
cached blocks are up to date.
The virtual file system (VFS) provides • Once a file is open, different clients can write it and
a standard interface, using v-nodes as get inconsistent data.
file handles. A v-node describes either
System Call Interface a local or remote file. • Modified data is flushed back to the server every 30
Virtual File System seconds.
UFS NFS RPCs to other (server) nodes
(local files) (remote files)
RPC requests from remote clients,
and server responses
buffer cache / i-node table
11/24/2008 9 11/24/2008 10
11/24/2008 11 11/24/2008 12
3
Example: Berkeley Sprite File System Example: Google File System (GFS)
• Unix file system developed for diskless workstations
with large memories at UCB (differs from NFS, AFS)
• Considers memory as a huge cache of disk blocks
– memory is shared between file system and VM NFS, etc.
• Files are permanently stored on servers
– servers have a large memory that acts as a cache as well
• Several workstations can cache blocks for read-only
files GFS
11/24/2008 13 11/24/2008 14
11/24/2008 15 11/24/2008 16
4
Files in GFS GFS Setup
• Files are huge by traditional standards Misc. servers
GFS Master
Replicas
• Most files are mutated by appending new data rather Client
Masters
than overwriting existing data GFS Master
Client
Client
• Once written, the files are only read, and often only
sequentially.
• Appending becomes the focus of performance C0 C1 C1 C0 C5
optimization and atomicity guarantees C5 C2 C5 C3 … C2
11/24/2008 17 11/24/2008 18
Architecture Architecture
• GFS cluster consists of a single master and multiple chunk servers and
is accessed by multiple clients.
• Each of these is typically a commodity Linux machine running a user-
level server process.
• Files are divided into fixed-size chunks identified by an immutable and
globally unique 64 bit chunk handle
• For reliability, each chunk is replicated on multiple chunk servers
• master maintains all file system metadata.
• The master periodically communicates with each chunk server in
HeartBeat messages to give it instructions and collect its state
• Neither the client nor the chunk server caches file data eliminating
cache coherence issues.
• Clients do cache metadata, however.
11/24/2008 19 11/24/2008 20
5
Read Process Specifications
• Single master vastly simplifies design • Chunk Size = 64 MB
• Clients never read and write file data through the master. Instead, a • Chunks stored as plain Unix files on chunk server.
client asks the master which chunk servers it should contact.
• Using the fixed chunk size, the client translates the file name and byte • A persistent TCP connection to the chunk server over an
offset specified by the application into a chunk index within the file extended period of time (reduce network overhead)
• It sends the master a request containing the file name and chunk index. • cache all the chunk location information to facilitate small
The master replies with the corresponding chunk handle and locations random reads.
of the replicas. The client caches this information using the file name
and chunk index as the key. • Master keeps the metadata in memory
• The client then sends a request to one of the replicas, most likely the • Disadvantages – Small files become Hotspots.
closest one. The request specifies the chunk handle and a byte range • Solution – Higher replication for such files.
within that chunk
11/24/2008 21 11/24/2008 22
11/24/2008 23 11/24/2008 24