Paper Gfs Summary
Paper Gfs Summary
DESIGN OVERVIEW
2.1 Assumptions
In designing a file system for our needs, we have been guided by
assumptions that offer both challenges and opportunities. We alluded
to some key observations earlier and now lay out our assumptions in
more detail.
• The system is built from many inexpensive commodity
components that often fail. It must constantly monitor itself,
detect, tolerate, and recover promptly from component failures on a
routine basis.
• The system stores a modest number of large files. We expect a
few million files, each typically 100 MB or larger in size. Multi-
GB files are the common case and should be managed efficiently.
Small files must be supported, but we need not optimize for them.
• The workloads primarily consist of two kinds of reads: large
streaming reads and small random reads.
• In large streaming reads, individual operations typically read
hundreds of KBs, more commonly 1 MB or more. Successive
operations from the same client often read through a
contiguous region of a file.
• A small random read typically reads a few KBs at some
arbitrary offset. Performance-conscious applications often
batch and sort their small reads to advance steadily through
the file rather than go back and forth.
• The workloads also have many large, sequential writes that
append data to files. Once written, files are seldom modified
again.
• The system must efficiently implement well-defined semantics for
multiple clients that concurrently append to the same file.
• High sustained bandwidth is more important than low latency.
Most of our target applications place a premium on processing
data in bulk at a high rate.
1
2.2 Interface
GFS provides a familiar file system interface, though it does not
implement a standard API such as POSIX. Files are organized
hierarchically in directories and identified by pathnames.
Moreover, GFS has snapshot and record append operations:
• Snapshot creates a copy of a file or directory at low cost.
2.3 Architecture
A GFS cluster consists of a single master and multiple chunkservers,
accessed by multiple clients.
• Files are divided into fixed-size chunks.
2
2.5 Chunk Size
• Chunk size = 64 MB (much larger than typical file system block
sizes).
• Large chunks reduce interactions with the master, optimize
network usage, and minimize metadata storage.
• A large chunk size can cause hot spots when many clients access a
single small file.
• To mitigate hot spots, we use higher replication factors and
stagger application start times in batch systems.
These design decisions ensure efficient scalability, fault tolerance,
and high performance for distributed applications using GFS.
2.6 Metadata
The master stores three major types of metadata: the file and chunk
namespaces, the mapping from files to chunks, and the locations of
each chunk’s replicas. All metadata is kept in the master’s memory.
The first two types (namespaces and file-to-chunk mapping) are also
persisted by logging changes to an operation log, which is stored on the
master’s local disk and replicated on remote machines. Using a log
ensures simple, reliable updates and prevents inconsistencies in case of a
master crash. The master does not store chunk location information
persistently. Instead, it queries each chunkserver about its chunks at
startup and whenever a chunkserver joins the cluster.
3
2.6.1 In-Memory Data Structures
Since metadata is stored in memory, master operations are fast.
Additionally, the master can efficiently scan its entire state in the
background. This periodic scanning helps in chunk garbage collection,
re-replication in case of chunkserver failures, and chunk migration for
load balancing and efficient disk usage.
A concern with this memory-only approach is that the total system
capacity depends on the master’s memory. However, this is not a major
limitation because the master stores less than 64 bytes of metadata
per 64 MB chunk. Since most chunks are full, metadata requirements
remain low. Similarly, file namespace data is stored compactly using
prefix compression, keeping memory usage efficient. If necessary,
adding more memory is a small cost compared to the benefits of
reliability, performance, and simplicity.
4
Since metadata persistence is critical, GFS does not make changes visible
to clients until they are securely written to the log. The log is
replicated across multiple remote machines, and the master only
acknowledges client operations after successfully flushing the log to
both local and remote storage. To improve efficiency, multiple log
records are batched before writing to disk.
During recovery, the master replays the operation log to restore the
file system state. However, to avoid long startup times, the master
checkpoints its state when the log grows too large. This checkpoint is
stored in a compact B-tree format, which allows fast memory mapping
and namespace lookups without extra parsing.
Creating a checkpoint is structured so that ongoing operations are not
delayed. The master switches to a new log file and creates the
checkpoint in a separate thread, ensuring smooth operation. In large
clusters, a checkpoint can be created in about a minute. Once
completed, it is stored both locally and remotely.
For recovery, only the latest checkpoint and subsequent log files are
needed. Older checkpoints and logs can be deleted, although some are
kept as backups for disaster recovery. If a failure occurs during
checkpointing, the system detects and skips the incomplete checkpoint,
ensuring correctness.
---------------------------------------------------------------------------------
3. System Interactions
The system is designed to minimize the master’s involvement in
operations. This section explains how the client, master, and
chunkservers interact to perform data mutations, atomic record
append, and snapshots.
5
1. The order in which the master grants leases.
2. The serial numbers assigned by the primary within a lease
period.
Lease Mechanism
• A lease starts with a 60-second timeout but can be extended
indefinitely if the chunk is actively modified.
• Lease extensions are piggybacked on regular HeartBeat
messages exchanged between the master and chunkservers.
• The master can revoke a lease early, for example, when renaming
a file.
• If the master loses communication with a primary, it can safely
grant a new lease to another replica after the old lease expires.
Mutation Process
The following steps describe how a write operation is processed:
1. Client Request: The client asks the master for the current lease
holder and locations of other replicas. If no lease exists, the master
grants one (not shown).
2. Master Response: The master sends the primary replica's
identity and the locations of secondary replicas. The client
caches this information for future writes.
3. Data Push: The client sends data to all replicas in any order.
Each chunkserver stores the data in an internal buffer until it is
used or removed. Decoupling data flow from control flow improves
performance.
4. Write Request to Primary: Once all replicas acknowledge data
reception, the client sends a write request to the primary,
identifying the data chunk.
5. Mutation Order Enforcement: The primary assigns serial
numbers to all mutations (from multiple clients) and applies them in
order.
6. Propagation to Secondaries: The primary forwards the mutation
request to all secondary replicas, which apply the changes in the
same order.
7. Completion Acknowledgment: All secondaries report success to
the primary, which then reports back to the client.
6
Handling Errors
• If any replica encounters an error, the primary notifies the client.
• If the mutation fails at the primary, it is not assigned a serial
number and not forwarded, preventing inconsistencies.
• If it fails at some secondary replicas, the file region becomes
inconsistent. The client retries the operation a few times before
restarting the process.
7
Goals of Data Flow Design
• Maximize network bandwidth: Each chunkserver forwards data
sequentially in a linear chain, using its full outbound bandwidth
rather than splitting it among multiple recipients.
• Avoid network bottlenecks: Data is sent to the closest available
chunkserver to minimize inter-switch traffic and high-latency
links.
• Reduce latency: Pipelining over TCP ensures that a chunkserver
forwards data immediately upon receiving it.
8
• Eliminates the need for complex synchronization mechanisms
like distributed locks.
• Useful for multi-producer/single-consumer queues and merged
result logs.
3.4 Snapshot
The snapshot operation creates an instant copy of a file or directory
without interrupting ongoing writes.
9
3. Metadata Duplication: The new snapshot points to the same
chunks as the original file.
---------------------------------------------------------------------------------
4. Master Operation
The master executes all namespace operations. Additionally, it manages
chunk replicas throughout the system by making placement decisions,
creating new chunks (and replicas), and coordinating system-wide
activities to maintain replication, balance load, and reclaim unused
storage.
10
4.2 Replica Placement
A GFS cluster is highly distributed, with chunkservers spread across
multiple racks. Since inter-rack bandwidth may be limited, replica
placement must ensure scalability, reliability, and availability.
The placement policy:
• Maximizes data reliability by spreading replicas across machines and
racks.
• Ensures availability even if an entire rack fails.
• Balances read traffic across multiple racks while accepting trade-offs
for write operations.
4.4.1 Mechanism
• Deleted files are renamed with timestamps and removed after a
configurable period (e.g., three days).
• Hidden files can be read and restored until final deletion.
• The master removes orphaned chunks by comparing namespace
metadata with chunkserver reports.
11
4.4.2 Discussion
Garbage collection is preferred over immediate deletion as it:
• Simplifies failure recovery by ensuring all references are tracked.
• Integrates into background tasks, reducing overhead.
• Provides a safety net against accidental deletions.
A drawback is the delayed storage reclamation, which can be mitigated
by expediting the process when necessary.
12
• Exploring alternatives like parity and erasure codes for storage
efficiency.
13
https://manybutfinite.com/post/how-the-kernel-manages-your-memory/
14