Chap 6 Distributed File System
Chap 6 Distributed File System
Storage service
Disk service: giving a transparent view of distributed disks.
Block service giving the same logical view of disk-accessing units.
True file service
File-accessing mechanism: deciding a place to manage remote files and
unit to transfer data (at server or client? file,
block or byte?)
File-sharing semantics: providing similar to Unix but weaker file update
semantics
File-caching mechanism: improving performance/scalability
File-replication mechanism: improving performance/availability
Name service
Mapping between text file names and reference to files, (i.e. file IDs)
Directory service
2
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
DFS Desirable Features
Transparency: should include structure, access, naming, and replication
transparency.
User mobility: should not force a user to work on a specific node.
Performance: should be comparable to that of a centralized file system.
Simplicity: should give the same semantics as a centralized file system.
Scalability: should cope with the growth of nodes.
Fault tolerance: should not face a failure stop and maintain backup copies.
Synchronization: should complete concurrent access requests consistently.
Security: should protect files from network intruders.
Heterogeneity: should allow a variety of nodes to share files in different
storage media
3
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File Models
Unstructured and Structured Files
An un-interpreted sequence of bytes: UNIX and MSDOS:
Non-indexed records: IBM mainframe
Indexed records such as B-tree: Research Storage System(RSS) and
Oracle
Mutable and Immutable Files
Mutable: a single stored sequence altered by each update (ex. Unix
and MSDOS)
Immutable: a history of immutable versions, each created every
update (ex. Cedar File System)
4
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Accessing Models
Accessing Remote Files
File access Merits Demerits
Remote service At a server A simple Communication
model implementation overhead
Data caching model At a client that Reducing network Cache consistency
cached a file copy traffic problem
Unit of Data Transfer
Transfer level Merits Demerits
File Simple, less communication A client required to have large
overhead, and immune to server storage space
Block A client not required to have large More network traffic/overhead
storage space
Byte Flexibility maximized Difficult cache management to
handle the variable-length data
Record Handling structured and indexed More network traffic
files More overhead to re-construct a file.
5
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Sharing Semantics
Define when modifications of the file data
made by a user are observable by other users
1. Unix semantics
2. Session Semantics
3. Immutable shared-files semantics
4. Transaction-like semantics
6
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Sharing Semantics
Unix Semantics
Absolute Ordering
Client A Append(d)
read
a b
a b a b c a b c a b c d a b c d e a b c d e
a b c
t1 t2 t3 t4 t5 t6
delayed
Append(c) Append(d) read
Client B
Network Delays
7
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Sharing Semantics
Session Semantics
Client A Client B Client C Server
a b
Open(file) a b
Append(c) a b c
Open(file) a b
Append(d) a b c d
Append(x) a b x
Append(e) a b c d e
Append(y) a b c y
Close(file)
a b c d e
Append(z) a b c d z
Open(file) a b c d e
Close(file)
a b c d z
Append(m) a b c d e m
Close(file)
a b c d e m
8
Close(file)
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Sharing Semantics
Transaction-Like Semantics (Concurrency Control)
Backward validation Forward validation
Client A Client B Client C Client D Client A Client B Client C Client D
Trans_start Trans_start
Compare reads with Compare write with
R1 R1 later reads
former writes
R2 R2
Trans_start Trans_start
W3 W3
R4 R1 R4 R1
W5 R2 Trans_start W5 R2 Trans_start
W6 W6
validation R1 validation R1
R4 R4
Commitment W7 R2 Commitment W7 R2
Trans_start Trans_start
Trans_end W9 Trans_end W9
R1 R1
R4 R4
R2 Trans_abort R2
W8 W8
Trans_end R6 Trans_restart R6
R8 R8
W8 W8
Trans_end Trans_end
Trans_abort
Abort itself or conflicting active transactions
Trans_restart
Trans_end
Tentative Tentative
based on based on
1.0 1.0
Version
1.1
Version conflict
Abort
Depend on each file system.
Version Version
Abortion is simple (later, the client A can 1.2 1.2
Decide to overwrite it with its tentative 1.0
by changing the corresponding directory) Ignore conflict Merge
10
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Caching Schemes
Cache Location
Node boundary
Location Merits Demerits
Client Server No caching No modifications Frequent disk access,
Busy network traffic
Main Main In server’s One-time disk access, Busy network traffic
memory memory main Easy implementation,
copy memory Unix-like file-sharing
copy
semantics
In client’s One-time network Cache consistency
copy disk access, problem,
Disk No size restriction File access semantics,
Disk
Frequent disk access,
file No Diskless
workstation
In client’s Maximum Size restriction,
main performance, Cache consistency
memory Diskless workstation, problem,
Scalability File access semantics
11
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Caching Schemes
Modification Propagation
Client 1 Client 2 Write-through scheme
Main Main Pros: Unix-like semantics and high reliability
memory memory Cons: Poor write performance
copy new
copy Delayed-write scheme
W
W Write on cache displacement
Immediate write Periodic write
Disk
file Write on close
W Pros:
Client 1 Write accesses complete quickly
Client 2
Main Some writes may be omitted by the following
Main
memory memory writes.
copy new Gathering all writes mitigates network overhead.
W copy
W Cons:
delayed write Delaying of write propagation results in fuzzier
Disk file-sharing semantics.
file
12
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Caching Schemes
Cache Validation Schemes – Client-Initiated Approach
Client 1 Client 2
Main Main
memory memory
Checking before every access (Unix-like
copy semantics but too slow)
W copy
Checking periodically (better performance
Write through but fuzzy file-sharing semantics)
Disk Check before
Delayed write? every access
file Checking on file open (simple, suitable for
W
session-semantics)
Client 1 Client 2
Problem: High network traffic
Main Main
memory memory
copy new
W copy
W
W
Write-on-close Disk Check-on-open
file Check-on-close?
W
13
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
File-Caching Schemes
Cache Validation Schemes – Server-Initiated Approach
Client 1 Client 2 Client 3 Client 4
Main Main Main Main
memory memory memory memory
copy copy copy
W
W Deny for a new open
W
Write through Notify (invalidate)
Or Disk
Delayed write?
file
W
Keeping track of clients having a copy
Denying a new request, queuing it, and disabling caching
Notifying all clients of any update on the original file
Problem:
violating client-server model
Stateful servers
Check-on-open still needed for the 2nd file opening.
14
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
DFS: Case Studies
NFS (Network File System)
Developed by Sun Microsystems (in 1985)
Most popular, open, and widely used.
NFS protocol standardized through IETF (RFC 1813)
AFS (Andrew File System)
Developed by Carnegie Mellon University as part of Andrew
distributed computing environments (in 1986)
A research project to create campus wide file system.
Public domain implementation is available on Linux
(LinuxAFS)
It was adopted as a basis for the DCE/DFS file system in
the Open Software Foundation (OSF, www.opengroup.org)
DEC (Distributed Computing Environment
15
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Figure 8 shows the architecture of Sun NFS.
16
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
NFS architecture
Client computer Server computer
Application Application
program program
UNIX
system calls
UNIX kernel
UNIX kernel Virtual file system Virtual file system
Operations Operations
on local files on
remote files
file system
system system
NFS protocol
Figure 8. NFS architecture (remote operations)
fh = file handle:
18
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
A simplified representation of the RPC
interface provided by NFS version 3
servers is shown in Figure 9.
19
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
• read(fh, offset, count) -> attr, data
• write(fh, offset, count, data) -> attr
• create(dirfh, name, attr) -> newfh, attr
• remove(dirfh, name) status
• getattr(fh) -> attr
• setattr(fh, attr) -> attr
• lookup(dirfh, name) -> fh, attr
• rename(dirfh, name, todirfh, toname)
• link(newdirfh, newname, dirfh, name)
• readdir(dirfh, cookie, count) -> entries
• symlink(newdirfh, newname, string) -> status
• readlink(fh) -> string
• mkdir(dirfh, name, attr) -> newfh, attr
• rmdir(dirfh, name) -> status
• statfs(fh) -> fsstats
20
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
NFS access control and authentication
The NFS server is stateless server, so the user's
identity and access rights must be checked by the
server on each request.
In the local file system they are checked only on the
file’s access permission attribute.
Every client request is accompanied by the userID
and groupID
It is not shown in the Figure 8.9 because they are
inserted by the RPC system.
Kerberos has been integrated with NFS to provide
a stronger and more comprehensive security
solution.
21
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Mount service
Mount operation:
mount(remotehost, remotedirectory, localdirectory)
Server maintains a table of clients who have
mounted filesystems at that server.
Each client maintains a table of mounted file
systems holding:
< IP address, port number, file handle>
Remote file systems may be hard-mounted or
soft-mounted in a client computer.
Figure 10 illustrates a Client with two remotely
mounted file stores.
22
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Server 1 Client Server 2
(root) (root) (root)
Remote Remote
people students x staff users
mount mount
Note: The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in Server 1;
the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2.
Figure 10. Local and remote file systems accessible on an NFS client
24
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Automounter Provides a simple form of replication
for read-only filesystems.
E.g. if there are several servers with identical copies
of /usr/lib then each server will have a chance of
being mounted at some clients.
25
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Server caching
Similar to UNIX file caching for local files:
pages (blocks) from disk are held in a main memory
buffer cache until the space is required for newer
pages. Read-ahead and delayed-write optimizations.
For local files, writes are deferred to next sync event
(30 second intervals).
Works well in local context, where files are always
accessed through the local cache, but in the remote
case it doesn't offer necessary synchronization
guarantees to clients.
26
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
NFS v3 servers offers two strategies for
updating the disk:
Write-through - altered pages are written to
disk as soon as they are received at the
server. When a write() RPC returns, the
NFS client knows that the page is on the
disk.
Delayed commit - pages are held only in the
cache until a commit() call is received for
the relevant file. This is the default mode
used by NFS v3 clients. A commit() is
issued by the client whenever a file is
closed.
27
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Client caching
Server caching does nothing to reduce
RPC traffic between client and server
further optimization is essential to reduce
server load in large networks.
NFS client module caches the results of
read, write, getattr, lookup and readdir
operations
synchronization of file contents (one-copy
semantics) is not guaranteed when two or
more clients are sharing the same file.
28
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Timestamp-based validity check
It reduces inconsistency, but doesn't
eliminate it.
It is used for validity condition for cache
entries at the client:
(T - Tc < t) v (Tmclient = Tmserver)
t freshness guarantee
Tc time when cache entry was last
validated
Tm time when block was last updated at
server
T current time
29
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
t is configurable (per file) but is typically set
to 3 seconds for files and 30 secs. for
directories.
it remains difficult to write distributed
applications that share files with NFS.
30
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Other NFS optimizations
Sun RPC runs over UDP by default (can use TCP
if required).
Uses UNIX BSD Fast File System with 8-kbyte
blocks.
reads() and writes() can be of any size
(negotiated between client and server).
The guaranteed freshness interval t is set
adaptively for individual files to reduce getattr()
calls needed to update Tm.
File attribute information (including Tm) is
piggybacked in replies to all file requests.
31
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
NFS performance
Early measurements (1987) established that:
Write() operations are responsible for only 5% of
server calls in typical UNIX environments.
– hence write-through at server is acceptable.
Lookup() accounts for 50% of operations -due to
step-by-step pathname resolution necessitated by
the naming and mounting semantics.
More recent measurements (1993) show high
performance.
see www.spec.org for more recent measurements.
32
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
NFS summary
NFS is an excellent example of a simple,
robust, high-performance distributed
service.
Achievement of transparencies are other
goals of NFS:
Access transparency:
– The API is the UNIX system call interface for
both local and remote files.
33
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Location transparency:
– Naming of filesystems is controlled by client
mount operations, but transparency can be
ensured by an appropriate system configuration.
Mobility transparency:
– Hardly achieved; relocation of files is not
possible, relocation of filesystems is possible,
but requires updates to client configurations.
Scalability transparency:
– File systems (file groups) may be subdivided
and allocated to separate servers.
Ultimately, the performance limit is determined
by the load on the server holding the most
heavily-used filesystem (file group).
34
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Replication transparency:
– Limited to read-only file systems; for writable
files, the SUN Network Information Service (NIS)
runs over NFS and is used to replicate essential
system files.
Hardware and software operating system
heterogeneity:
– NFS has been implemented for almost every
known operating system and hardware platform
and is supported by a variety of filling systems.
Fault tolerance:
– Limited but effective; service is suspended if a
server fails. Recovery from failures is aided by
the simple stateless design.
35
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Consistency:
– It provides a close approximation to one-copy
semantics and meets the needs of the vast
majority of applications.
– But the use of file sharing via NFS for
communication or close coordination between
processes on different computers cannot be
recommended.
Security:
– Recent developments include the option to use
a secure RPC implementation for authentication
and the privacy and security of the data
transmitted with read and write operations.
36
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: Sun NFS
Efficiency:
–NFS protocols can be implemented for use in
situations that generate very heavy loads.
37
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: The Andrew File System (AFS)
Like NFS, AFS provides transparent
access to remote shared files for UNIX
programs running on workstations.
AFS is implemented as two software
components that exist at UNIX processes
called Vice and Venus.
(Figure 11)
38
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: The Andrew File System (AFS)
Workstations Servers
User Venus
program
Vice
UNIX kernel
UNIX kernel
Vice
Venus
User
program UNIX kernel
UNIX kernel
40
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: The Andrew File System (AFS)
Local Shared
/ (root)
bin
Symbolic
links
42
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: The Andrew File System (AFS)
Workstation
User Venus
program
UNIX file Non-local file
sy stem calls operations
UNIX kernel
UNIX file system
Local
disk
44
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: The Andrew File System (AFS)
User pro cess UNIX kern el Ven us Net Vice
o pen (F ileName, If Fil eNa me refer s to a
mo de) fi le in sh ared file sp ace,
p ass th e req uest to Ch eck list o f files in
lo cal cach e. If no t
Venu s. p resen t o r th ere is no
v al id cal lba ck pro mise,
send a req uest for t he
fi le to the Vice serv er
th at is custo d ian o f t he
v olu m e co n tainin g the T ran sfer a cop y of th e
fi le. fi le and a cal lba ck
p ro mise to th e
wo rkstatio n . Lo g t he
Place th e cop y of the callb ack p rom ise.
fi le in the lo cal fi le
Op en th e lo cal file an d system , en ter its lo cal
retu rn the file n am e in th e local cach e
d escrip to r to th e list an d return th e lo cal
app licatio n. n am e to UNIX.
read ( FileDescripto r, Perform a n or mal
Bu ffer, len gth ) UNIX read op eratio n
o n th e lo cal cop y.
write( FileDescript or, Perform a n or mal
Bu ffer, len gth ) UNIX wri te op eratio n
o n th e lo cal cop y.
clo se( Fi leDescrip tor) Clo se the lo cal cop y
and n oti fy Venu s that
th e file h as been closed . If the l ocal co py has
b een chan ged , send a
cop y to the Vice serv er Repl ace th e fi le
th at is th e custo dian o f con ten ts and send a
th e file. cal lba ck to all o th er
clien ts ho ld ingca llba ck
p ro mises o n t he file.
46
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Case Study: The Andrew File System (AFS)
Fetch(fid) -> attr, data Returns the attributes (status) and, optionally, the contents of file
identified by the fid and records a callback promise on it.
Store(fid, attr, data) Updates the attributes and (optionally) the contents of a specified
file.
Create() -> fid Creates a new file and records a callback promise on it.
Remove(fid) Deletes the specified file.
SetLock(fid, mode) Sets a lock on the specified file or directory. The mode of the
lock may be shared or exclusive. Locks that are not removed
expire after 30 minutes.
ReleaseLock(fid) Unlocks the specified file or directory.
RemoveCallback(fid) Informs server that a Venus process has flushed a file from its
cache.
BreakCallback(fid) This call is made by a Vice server to a Venus process. It cancels
the callback promise on the relevant file.
47
Kanchan K. Doke, Computer Engineering Dept, B.V.C.O.E.
Kanchan K. Doke, Computer Engineering Dept,
B.V.C.O.E. 48
INTRODUCTION
• File Replication : High availability is a desirable feature of a
good distributed file system and file replication is the
primary mechanism for improving file availability.
Replication is a key strategy for improving reliability, fault
tolerance and availability. Therefore duplicating files on
multiple machines improves availability and performance.
• Replicated file : A replicated file is a file that has multiple
copies, with each copy located on a separate file server. Each
copy of the set of copies that comprises a replicated file is
referred to as replica of the replicated file.
Difference between Replication and Caching
Draw back
The main problem of file replication is consistency. That is when
one copy of replica changes, how does the other copies reflect
that change.
Multi Copy Update Problem
Maintaining consistency among copies when a
replicated file is updated is the major issue of
file system that supports replication of files.
Some commonly used approaches to handle
this issue are described below:
1. Read -Only-Replication
2. Read -Any-Write- All Protocol
3. Available –Copies Protocol
4. Primary-Copy Protocol
5. Quorum-Based Protocol
Multi Copy Update Protocols
1. Read- Only- Replication: This approach allows the replication of
only immutable files, since immutable files are used only in the read-
only mode, because mutable files cannot be replicated.
This approach is too restrictive in the sense that it allows the
replication of only immutable files.
2. Read-Any-Write-All Protocol: This approach allows the replication
of mutable files. In this method, a read operation on a replicated file
is performed by reading any copy of the file and write operation by
writing to all copies of the file. Some of the lock has to be used to
carryout a write operation. That is, before updating any copy, all
copies are locked, then the they are updated, and finally locks are
released to complete write operation. The protocol is used for
implementing UNIX like Semantics
The main problem with this approach is that a write operation can’t
be performed if any of the servers having a copy of the replicated file
is down at a time of write operation.
Contd…
3.Available-Copies Protocol: this Approach allows the write
operation to be carried out even when some of the servers having a
copy of the replicated file are down. In this method the read
operation is performed by reading any available copy, but a write
operation is performed by writing to all available copies. When the
server recovers after a failure, it brings itself up to date by copying
from the other servers before accepting any user request.
4.Primary-Copy Protocol: another simple method to solve the
multi-copy update problem is the primary –copy protocol. In this
protocol for each replicated file one copy is as the primary copy
and all others are secondary copies. Read operation can be
performed using any copy primary or secondary. Each server
having a copy updates its copy either by receiving notification of
changes from the server having the primary copy or by requesting
the updated copy from it.
Contd…
Draw backs
The read-any-write-all and available –copies protocols cannot handle
the network partition problem in which the copies of a replicated file
are partitioned into two or more active groups. Moreover, the primary-
copy – protocol is too restrictive in the sense that a write operation
cannot be perform if the server having the primary copy is down.
5. Quorum –Based Protocol: This protocol is
capable of handling the network partition problem
and can increase the availability of write operations
at the expense of read operation.