Distributed File Systems in Unix
Distributed File Systems in Unix
What is AFS?
AFS is a distributed filesystem product, pioneered at Carnegie Mellon University and
supported and developed as a product by Transarc Corporation (now IBM Pittsburgh
Labs). It offers a client-server architecture for federated file sharing and replicated read-
only content distribution, providing location independence, scalability, security, and
transparent migration capabilities. AFS is available for a broad range of heterogeneous
systems including UNIX, Linux,
AFS Design Principles
What was learned.
Think about for file systems and other large distributed systems.
• Workstations have cycles to burn. Make clients do work whenever
possible.
• Cache whenever possible.
• Exploit file usage properties. Understand them. One-third of Unix files
are
temporary.
• Minimize system-wide knowledge and change. Do not hardwire
locations.
• Trust the fewest possible entities. Do not trust workstations.
• Batch if possible to group operations.
CS 4513 9 week5-dfs.tex
Elephant: The File System that Never Forgets
Motivation that disks and storage are cheap and information is
valuable.
Straightforward idea to store all (significant) versions of a file without
need for user
intervention.
”All user operations are reversible.”
Simple, but powerful goal for the system.
A new version of a file is created each time it is written—similarities to
a log-structured file
system.
File versions are referenced by time and extend to directories.
Per-file and per-file-group policies for reclaiming file storage.
What Files to Keep?
Basic idea is to keep landmark or distinguished file versions and
discard the others.
• Keep One—current situation. Good for unimportant or easily
recreatable files.
• Keep All—complete history maintained.
• Keep Landmarks—how to determine
– user-defined landmarks (similar to check-in idea in RCS) are allowed
– heuristic to tag other versions as landmarks.
Not all files should be treated the same. For example object files and
source files have
different characteristics.
Architecture
Single master and multiple chunkservers as shown in Fig 1. Each is a
commodity Linux
server.
Files stored in fixed-size 64MB chunks as Linux files. Each has a 64-bit
chunk handle.
By default have three replicas for each chunk.
GFS maintains metafile information.
Clients do not cache data—typically not reused. Do cache metadata.
Large chunk sizes help to minimize client interaction with master
(potential bottleneck).
Client can maintain persistent TCP connection with chunkserver.
Reduces amount of
metadata at master.
CS 4513 11 week5-dfs.tex
replicated execution environments.
Lots of replication drawbacks: bandwidth needed, requires hard state
at each replication,
and replicated run-time environment not the same as devlp env.
Shark designed to support widely distributed applications. Can export a
file system
Scalability through a location-aware cooperative cache—a p2p file
system for read sharing.
At heart is centralized file system like NFS.
Design
Key ideas:
Once a client retrieves a file, it becomes a replica proxy for serving to
other clients.
Files are stored and retrieved as chunks. Client can retrieve chunks
from multiple locations.
A token is assigned for the whole file and for each chunk.
Use Rabin fingerprint algorithm to preserve data commonality in
chunks—idea is that
different versions of a file have many chunks in common.
File Consistency
Uses leases and whole-file caching—ala AFS.
Default lease of 5min with callbacks.
Must refetch entire file if modified—but may not have to retrieve all
chunks and can do so
from client proxies.