Distributed Systems
Principles and Paradigms
Maarten van Steen
VU Amsterdam, Dept. Computer Science
Room R4.20,
[email protected] Chapter 11: Distributed File Systems
Version: December 4, 2011
1 / 17
Contents
Chapter
01: Introduction
02: Architectures
03: Processes
04: Communication
05: Naming
06: Synchronization
07: Consistency & Replication
08: Fault Tolerance
09: Security
10: Distributed Object-Based Systems
11: Distributed File Systems
12: Distributed Web-Based Systems
13: Distributed Coordination-Based Systems
2 / 17 2 / 17
Distributed File Systems 11.1 Architecture Distributed File Systems 11.1 Architecture
Distributed File Systems
General goal
Try to make a file system transparently available to remote clients.
1. File moved to client
Client Server Client Server
Old file
New file
Requests from
client to access File stays 2. Accesses are
3. When client is done,
remote file on server done on client
file is returned to
Remote access model Upload/download model
3 / 17 3 / 17
Distributed File Systems 11.1 Architecture Distributed File Systems 11.1 Architecture
Example: NFS Architecture
NFS
NFS is implemented using the Virtual File System abstraction, which is
now used for lots of different operating systems.
Client Server
System call layer System call layer
Virtual file system Virtual file system
(VFS) layer (VFS) layer
Local file Local file
system interface NFS client NFS server system interface
RPC client RPC server
stub stub
Network
4 / 17 4 / 17
Distributed File Systems 11.1 Architecture Distributed File Systems 11.1 Architecture
Example: NFS Architecture
Essence
VFS provides standard file system interface, and allows to hide
difference between accessing local or remote file system.
Question
Is NFS actually a file system?
5 / 17 5 / 17
Distributed File Systems 11.1 Architecture Distributed File Systems 11.1 Architecture
NFS File Operations
Oper. v3 v4 Description
Create Yes No Create a regular file
Create No Yes Create a nonregular file
Link Yes Yes Create a hard link to a file
Symlink Yes No Create a symbolic link to a file
Mkdir Yes No Create a subdirectory
Mknod Yes No Create a special file
Rename Yes Yes Change the name of a file
Remove Yes Yes Remove a file from a file system
Rmdir Yes No Remove an empty subdirectory
Open No Yes Open a file
Close No Yes Close a file
Lookup Yes Yes Look up a file by means of a name
Readdir Yes Yes Read the entries in a directory
Readlink Yes Yes Read the path name in a symbolic link
Getattr Yes Yes Get the attribute values for a file
Setattr Yes Yes Set one or more file-attribute values
Read Yes Yes Read the data contained in a file
Write Yes Yes Write data to a file
6 / 17 6 / 17
Distributed File Systems 11.1 Architecture Distributed File Systems 11.1 Architecture
Cluster-Based File Systems
Observation
With very large data collections, following a simple client-server
approach is not going to work for speeding up file accesses, apply
striping techniques by which files can be fetched in parallel.
File block of file a File block of file e
a b c d e
a b c d e
a b c d e
Whole-file distribution
a b a b a b
c e c d c d
d e e
File-striped system
7 / 17 7 / 17
Distributed File Systems 11.1 Architecture Distributed File Systems 11.1 Architecture
Example: Google File System
file name, chunk index
GFS client Master
contact address
Instructions Chunk-server state
Chunk ID, range
Chunk server Chunk server Chunk server
Chunk data
Linux file Linux file Linux file
system system system
The Google solution
Divide files in large 64 MB chunks, and distribute/replicate chunks across
many servers:
The master maintains only a (file name, chunk server) table in main
memory minimal I/O
Files are replicated using a primary-backup scheme; the master is kept
out of the loop
8 / 17 8 / 17
Distributed File Systems 11.1 Architecture Distributed File Systems 11.1 Architecture
P2P-based File Systems
Node where a file system is rooted
File system layer Ivy Ivy Ivy
Block-oriented storage DHash DHash DHash
DHT layer Chord Chord Chord
Network
Basic idea
Store data blocks in the underlying P2P system:
Every data block with content D is stored on a node with hash h(D).
Allows for integrity check.
Public-key blocks are signed with associated private key and looked up
with public key.
A local log of file operations to keep track of hblockID, h(D)i pairs.
9 / 17 9 / 17
Distributed File Systems 11.5 Synchronization Distributed File Systems 11.5 Synchronization
File sharing semantics
Client machine #1
Problem
When dealing with distributed file a b
systems, we need to take into account Process
A
the ordering of concurrent read/write a b c
operations and expected semantics 2. Write "c" 1. Read "ab"
(i.e., consistency).
File server
Original file
Single machine a b
a b
Process
A 3. Read gets "ab"
a b c
Client machine #2
Process
a b
B
Process
B
1. Write "c" 2. Read gets "abc"
(a) (b)
10 / 17 10 / 17
Distributed File Systems 11.5 Synchronization Distributed File Systems 11.5 Synchronization
File sharing semantics
Semantics
UNIX semantics: a read operation returns the effect of the last
write operation can only be implemented for remote access
models in which there is only a single copy of the file
Transaction semantics: the file system supports transactions on a
single file issue is how to allow concurrent access to a
physically distributed file
Session semantics: the effects of read and write operations are
seen only by the client that has opened (a local copy) of the file
what happens when a file is closed (only one client may actually
win)
11 / 17 11 / 17
Distributed File Systems 11.5 Synchronization Distributed File Systems 11.5 Synchronization
Example: File sharing in Coda
Essence
Coda assumes transactional semantics, but without the full-fledged
capabilities of real transactions. Note: Transactional issues reappear in
the form of this ordering could have taken place.
Session S A
Client
Open(RD) File f Invalidate
Close
Server
Close
Open(WR) File f
Client
Time
Session S B
12 / 17 12 / 17
Distributed File Systems 11.6 Consistency and Replication Distributed File Systems 11.6 Consistency and Replication
Consistency and replication
Observation
In modern distributed file systems, client-side caching is the preferred
technique for attaining performance; server-side replication is done for fault
tolerance.
Observation
Clients are allowed to keep (large parts of) a file, and will be notified when
control is withdrawn servers are now generally stateful
1. Client asks for file
Client Server
2. Server delegates file
Old file
Local copy 3. Server recalls delegation
Updated file
4. Client sends returns file
13 / 17 13 / 17
Distributed File Systems 11.6 Consistency and Replication Distributed File Systems 11.6 Consistency and Replication
Example: Client-side caching in Coda
Session S A Session SA
Client A
Open(RD) Close Close
Open(RD)
Invalidate
Server File f (callback break) File f
File f OK (no file transfer)
Open(WR)
Open(WR) Close Close
Client B
Time
Session S B Session S B
Note
By making use of transactional semantics, it becomes possible to
further improve performance.
14 / 17 14 / 17
Distributed File Systems 11.6 Consistency and Replication Distributed File Systems 11.6 Consistency and Replication
Example: Server-side replication in Coda
Server Server
S1 S3
Client Broken Client
Server
A network B
S2
Main issue
Ensure that concurrent updates are detected:
Each client has an Accessible Volume Storage Group (AVSG): is a
subset of the actual VSG.
Version vector CVVi (f )[j] = k Si knows that Sj has seen version k of f .
Example: A updates f S1 = S2 = [+1, +1, +0]; B updates
f S3 = [+0, +0, +1].
15 / 17 15 / 17
Distributed File Systems 11.7 Fault Tolerance Distributed File Systems 11.7 Fault Tolerance
High availability in P2P systems
Problem
There are many fully decentralized file-sharing systems, but because
churn is high (i.e., nodes come and go all the time), we may face an
availability problem replicate files all over the place (replication
factor: rrep ).
Alternative
Apply erasure coding:
Partition a file F into m fragments, and recode into a collection F
of n > m fragments
Property: any m fragments from F are sufficient to reconstruct F .
Replication factor: rec = n/m
16 / 17 16 / 17
Distributed File Systems 11.7 Fault Tolerance Distributed File Systems 11.7 Fault Tolerance
Replication vs. erasure coding
Comparison
With an average node availability a, 2.2
rrep
and required file unavailability , we rec 2.0
have for erasure coding:
1.8
rec m
rec m i
1 = i
a (1 a)rec mi 1.6
i =m
1.4
and for file replication: 0.2 0.4 0.6 0.8 1
Node availability
1 = 1 (1 a)rrep
17 / 17 17 / 17