0% found this document useful (0 votes)
2 views14 pages

Paper Gfs Summary

The document outlines the design and architecture of a file system called GFS, which is built on commodity components and focuses on handling large files efficiently while ensuring fault tolerance and high performance. It describes the system's interface, metadata management, and operations like data mutations, atomic record appends, and snapshots, emphasizing a single master architecture for simplicity and scalability. Key features include chunk replication, lease management for mutation order, and lazy garbage collection for storage management.

Uploaded by

mangal jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views14 pages

Paper Gfs Summary

The document outlines the design and architecture of a file system called GFS, which is built on commodity components and focuses on handling large files efficiently while ensuring fault tolerance and high performance. It describes the system's interface, metadata management, and operations like data mutations, atomic record appends, and snapshots, emphasizing a single master architecture for simplicity and scalability. Key features include chunk replication, lease management for mutation order, and lazy garbage collection for storage management.

Uploaded by

mangal jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 14

2.

DESIGN OVERVIEW
2.1 Assumptions
In designing a file system for our needs, we have been guided by
assumptions that offer both challenges and opportunities. We alluded
to some key observations earlier and now lay out our assumptions in
more detail.
• The system is built from many inexpensive commodity
components that often fail. It must constantly monitor itself,
detect, tolerate, and recover promptly from component failures on a
routine basis.
• The system stores a modest number of large files. We expect a
few million files, each typically 100 MB or larger in size. Multi-
GB files are the common case and should be managed efficiently.
Small files must be supported, but we need not optimize for them.
• The workloads primarily consist of two kinds of reads: large
streaming reads and small random reads.
• In large streaming reads, individual operations typically read
hundreds of KBs, more commonly 1 MB or more. Successive
operations from the same client often read through a
contiguous region of a file.
• A small random read typically reads a few KBs at some
arbitrary offset. Performance-conscious applications often
batch and sort their small reads to advance steadily through
the file rather than go back and forth.
• The workloads also have many large, sequential writes that
append data to files. Once written, files are seldom modified
again.
• The system must efficiently implement well-defined semantics for
multiple clients that concurrently append to the same file.
• High sustained bandwidth is more important than low latency.
Most of our target applications place a premium on processing
data in bulk at a high rate.

1
2.2 Interface
GFS provides a familiar file system interface, though it does not
implement a standard API such as POSIX. Files are organized
hierarchically in directories and identified by pathnames.
Moreover, GFS has snapshot and record append operations:
• Snapshot creates a copy of a file or directory at low cost.

• Record append allows multiple clients to append data to the


same file concurrently, ensuring atomicity.
These features are valuable for multi-way merge results and
producer-consumer queues in large distributed applications.

2.3 Architecture
A GFS cluster consists of a single master and multiple chunkservers,
accessed by multiple clients.
• Files are divided into fixed-size chunks.

• Each chunk is identified by a unique 64-bit handle.

• Chunkservers store chunks as Linux files.

• Chunks are replicated for reliability (default: three replicas).

• The master maintains all file system metadata.

• Clients interact with the master for metadata but communicate


directly with chunkservers for reading/writing data.
• Neither clients nor chunkservers cache file data to simplify design
and avoid coherence issues.

2.4 Single Master


Having a single master simplifies design and enables sophisticated
chunk placement decisions. To prevent bottlenecks:
• Clients never read/write data through the master.

• Clients cache metadata and interact directly with chunkservers.

• The master periodically communicates with chunkservers via


HeartBeat messages.

2
2.5 Chunk Size
• Chunk size = 64 MB (much larger than typical file system block
sizes).
• Large chunks reduce interactions with the master, optimize
network usage, and minimize metadata storage.
• A large chunk size can cause hot spots when many clients access a
single small file.
• To mitigate hot spots, we use higher replication factors and
stagger application start times in batch systems.
These design decisions ensure efficient scalability, fault tolerance,
and high performance for distributed applications using GFS.

2.6 Metadata
The master stores three major types of metadata: the file and chunk
namespaces, the mapping from files to chunks, and the locations of
each chunk’s replicas. All metadata is kept in the master’s memory.
The first two types (namespaces and file-to-chunk mapping) are also
persisted by logging changes to an operation log, which is stored on the
master’s local disk and replicated on remote machines. Using a log
ensures simple, reliable updates and prevents inconsistencies in case of a
master crash. The master does not store chunk location information
persistently. Instead, it queries each chunkserver about its chunks at
startup and whenever a chunkserver joins the cluster.

3
2.6.1 In-Memory Data Structures
Since metadata is stored in memory, master operations are fast.
Additionally, the master can efficiently scan its entire state in the
background. This periodic scanning helps in chunk garbage collection,
re-replication in case of chunkserver failures, and chunk migration for
load balancing and efficient disk usage.
A concern with this memory-only approach is that the total system
capacity depends on the master’s memory. However, this is not a major
limitation because the master stores less than 64 bytes of metadata
per 64 MB chunk. Since most chunks are full, metadata requirements
remain low. Similarly, file namespace data is stored compactly using
prefix compression, keeping memory usage efficient. If necessary,
adding more memory is a small cost compared to the benefits of
reliability, performance, and simplicity.

2.6.2 Chunk Locations


The master does not maintain a persistent record of chunk locations.
Instead, it polls chunkservers at startup to get this information and
keeps itself updated through regular HeartBeat messages.
Initially, GFS tried storing chunk location data persistently at the master,
but this made synchronization between the master and chunkservers
difficult. Instead, querying chunkservers dynamically at startup simplifies
the process and avoids inconsistency issues caused by server failures,
renaming, and restarts.
A key reason for this approach is that the chunkserver itself is the
most accurate source of truth regarding which chunks it holds. Errors
such as disk failures or server renaming can cause chunks to
disappear, so there is no point in maintaining a potentially outdated
master record.

2.6.3 Operation Log


The operation log is crucial in GFS, as it serves as the only persistent
record of metadata and acts as a logical timeline defining the order of
operations.

4
Since metadata persistence is critical, GFS does not make changes visible
to clients until they are securely written to the log. The log is
replicated across multiple remote machines, and the master only
acknowledges client operations after successfully flushing the log to
both local and remote storage. To improve efficiency, multiple log
records are batched before writing to disk.
During recovery, the master replays the operation log to restore the
file system state. However, to avoid long startup times, the master
checkpoints its state when the log grows too large. This checkpoint is
stored in a compact B-tree format, which allows fast memory mapping
and namespace lookups without extra parsing.
Creating a checkpoint is structured so that ongoing operations are not
delayed. The master switches to a new log file and creates the
checkpoint in a separate thread, ensuring smooth operation. In large
clusters, a checkpoint can be created in about a minute. Once
completed, it is stored both locally and remotely.
For recovery, only the latest checkpoint and subsequent log files are
needed. Older checkpoints and logs can be deleted, although some are
kept as backups for disaster recovery. If a failure occurs during
checkpointing, the system detects and skips the incomplete checkpoint,
ensuring correctness.

---------------------------------------------------------------------------------

3. System Interactions
The system is designed to minimize the master’s involvement in
operations. This section explains how the client, master, and
chunkservers interact to perform data mutations, atomic record
append, and snapshots.

3.1 Leases and Mutation Order


A mutation is any operation that modifies a chunk’s contents or
metadata, such as a write or append operation. Since each mutation
must be applied to all replicas of a chunk, we use leases to ensure a
consistent mutation order.
The master grants a lease to one replica, called the primary, which
decides the serial order of mutations. All secondary replicas follow this
order. The global mutation order is determined by:

5
1. The order in which the master grants leases.
2. The serial numbers assigned by the primary within a lease
period.

Lease Mechanism
• A lease starts with a 60-second timeout but can be extended
indefinitely if the chunk is actively modified.
• Lease extensions are piggybacked on regular HeartBeat
messages exchanged between the master and chunkservers.
• The master can revoke a lease early, for example, when renaming
a file.
• If the master loses communication with a primary, it can safely
grant a new lease to another replica after the old lease expires.

Mutation Process
The following steps describe how a write operation is processed:
1. Client Request: The client asks the master for the current lease
holder and locations of other replicas. If no lease exists, the master
grants one (not shown).
2. Master Response: The master sends the primary replica's
identity and the locations of secondary replicas. The client
caches this information for future writes.
3. Data Push: The client sends data to all replicas in any order.
Each chunkserver stores the data in an internal buffer until it is
used or removed. Decoupling data flow from control flow improves
performance.
4. Write Request to Primary: Once all replicas acknowledge data
reception, the client sends a write request to the primary,
identifying the data chunk.
5. Mutation Order Enforcement: The primary assigns serial
numbers to all mutations (from multiple clients) and applies them in
order.
6. Propagation to Secondaries: The primary forwards the mutation
request to all secondary replicas, which apply the changes in the
same order.
7. Completion Acknowledgment: All secondaries report success to
the primary, which then reports back to the client.

6
Handling Errors
• If any replica encounters an error, the primary notifies the client.
• If the mutation fails at the primary, it is not assigned a serial
number and not forwarded, preventing inconsistencies.
• If it fails at some secondary replicas, the file region becomes
inconsistent. The client retries the operation a few times before
restarting the process.

Handling Large Writes


If a write spans multiple chunks, the GFS client splits it into
multiple write operations, each following the process above. However,
concurrent writes from multiple clients may lead to a consistent
but undefined state, meaning the shared file region could contain
fragments from different clients. Despite this, all replicas remain
identical due to consistent mutation ordering.

3.2 Data Flow


To maximize network efficiency, data flow is decoupled from control
flow.
• Control Flow: Moves from the client → primary → secondary
replicas.
• Data Flow: Uses a pipelined chain of chunkservers to
efficiently transfer data.

7
Goals of Data Flow Design
• Maximize network bandwidth: Each chunkserver forwards data
sequentially in a linear chain, using its full outbound bandwidth
rather than splitting it among multiple recipients.
• Avoid network bottlenecks: Data is sent to the closest available
chunkserver to minimize inter-switch traffic and high-latency
links.
• Reduce latency: Pipelining over TCP ensures that a chunkserver
forwards data immediately upon receiving it.

How Data is Transferred


1. Linear Chain Transmission: The client sends data to the closest
chunkserver (S1).
2. Sequential Forwarding: S1 forwards to the next closest
chunkserver (S2), which then sends it to S3, and so on.
3. Pipelining for Speed: Each chunkserver starts forwarding
immediately instead of waiting for the entire data block.

Estimated Transfer Time


• If B is the data size, R is the number of replicas, T is network
throughput, and L is the link latency, the ideal transfer time is:
B/T+R⋅L
• On a 100 Mbps network, transferring 1 MB to multiple replicas
ideally takes ~80 ms.

3.3 Atomic Record Appends


Record append is an atomic operation in GFS that appends data to a file
without the client specifying an offset.

Difference from Traditional Write


• Traditional Write: Client specifies an offset; concurrent writes can
create mixed data fragments.
• Record Append: Client specifies only the data; GFS
automatically chooses the offset and ensures atomicity.
This approach is similar to O_APPEND mode in Unix but avoids race
conditions when multiple clients write concurrently.

Why Use Record Append?


• Used heavily in distributed applications where multiple clients
append to a shared file.

8
• Eliminates the need for complex synchronization mechanisms
like distributed locks.
• Useful for multi-producer/single-consumer queues and merged
result logs.

How Record Append Works


1. Data Transfer: The client pushes data to all replicas of the last
chunk of the file.
2. Primary Checks Size:
• If the record fits, the primary appends it and tells secondaries
to do the same.
• If the record exceeds the chunk size (64 MB), the primary
pads the chunk and asks the client to retry on the next chunk.
• To reduce fragmentation, record append is limited to ¼ of the
chunk size.
3. Failure Handling:
• If a failure occurs, the client retries the append.
• This can result in duplicate records across replicas, but GFS
ensures at least one successful write.
• Successful record appends produce a consistent (defined)
region, while failed or partial writes leave an inconsistent
(undefined) region.

3.4 Snapshot
The snapshot operation creates an instant copy of a file or directory
without interrupting ongoing writes.

Why Use Snapshots?


• Allows users to create branch copies of large datasets.
• Used for checkpointing before experiments, enabling easy
rollback.

How Snapshot Works


1. Revoking Leases: The master revokes existing leases on the
file’s chunks to prevent ongoing writes.
2. Logging the Snapshot: The master records the operation and
duplicates metadata in memory.

9
3. Metadata Duplication: The new snapshot points to the same
chunks as the original file.

Handling Writes After Snapshot


• If a client attempts to write to a chunk after a snapshot, the
master creates a new chunk (C’).
• This new chunk is placed on the same chunkservers as the
original to avoid network transfer (local disk copies are faster).
• After creation, the master grants a lease on C’, and the client
continues writing normally, unaware of the snapshot.

---------------------------------------------------------------------------------

4. Master Operation
The master executes all namespace operations. Additionally, it manages
chunk replicas throughout the system by making placement decisions,
creating new chunks (and replicas), and coordinating system-wide
activities to maintain replication, balance load, and reclaim unused
storage.

4.1 Namespace Management and Locking


Some master operations take a long time, such as snapshotting, which
revokes chunkserver leases. To prevent delays, multiple operations run
concurrently using locks over namespace regions for serialization.
GFS does not use traditional per-directory data structures; instead, it
maps full pathnames to metadata in a lookup table with prefix
compression for efficient storage. Each namespace node (file or directory)
has a read-write lock. Operations acquire a set of locks before execution:
read-locks for directory names and a read or write lock for the full
pathname.
This locking mechanism prevents conflicts, such as creating a file while its
parent directory is being snapshotted. Since directories are logical
constructs without inode-like structures, only read-locks are needed for
protection against deletion. This approach allows concurrent mutations in
the same directory while ensuring serialization for conflicting operations.
Locks are allocated dynamically and ordered consistently to prevent
deadlocks.

10
4.2 Replica Placement
A GFS cluster is highly distributed, with chunkservers spread across
multiple racks. Since inter-rack bandwidth may be limited, replica
placement must ensure scalability, reliability, and availability.
The placement policy:
• Maximizes data reliability by spreading replicas across machines and
racks.
• Ensures availability even if an entire rack fails.
• Balances read traffic across multiple racks while accepting trade-offs
for write operations.

4.3 Creation, Re-replication, Rebalancing


Replica creation occurs for three reasons:
1. Chunk Creation - The master places new replicas based on disk
utilization, recent chunk creations, and rack distribution.
2. Re-replication - If the number of replicas drops below the required
level due to failures or corruption, the master prioritizes chunks
needing replication.
3. Rebalancing - The master periodically redistributes replicas to
balance disk space and load while gradually filling new chunkservers.
Re-replication clones data from existing replicas while throttling
bandwidth to prevent network congestion.

4.4 Garbage Collection


GFS does not immediately reclaim storage after file deletion; instead, it
employs lazy garbage collection:

4.4.1 Mechanism
• Deleted files are renamed with timestamps and removed after a
configurable period (e.g., three days).
• Hidden files can be read and restored until final deletion.
• The master removes orphaned chunks by comparing namespace
metadata with chunkserver reports.

11
4.4.2 Discussion
Garbage collection is preferred over immediate deletion as it:
• Simplifies failure recovery by ensuring all references are tracked.
• Integrates into background tasks, reducing overhead.
• Provides a safety net against accidental deletions.
A drawback is the delayed storage reclamation, which can be mitigated
by expediting the process when necessary.

4.5 Stale Replica Detection


Chunk replicas become stale when a chunkserver misses updates. The
master tracks chunk version numbers to identify stale replicas.
When granting a new lease, the master increments the chunk version and
updates valid replicas before allowing writes. Any chunkserver reporting
an outdated version is flagged, and stale replicas are ignored in client
responses and eventually deleted.
---------------------------------------------------------------------------------

5. Fault Tolerance And Diagnosis


5.1 High Availability
5.1.1 Fast Recovery
• Both master and chunkserver can restart within seconds.
• Servers do not differentiate between normal and abnormal
termination.
• Clients and other servers handle interruptions by timing out and
reconnecting.

5.1.2 Chunk Replication


• Each chunk is stored on multiple chunkservers across different racks.
• Default replication level is three, but users can adjust as needed.
• Replication helps maintain availability despite server failures.

12
• Exploring alternatives like parity and erasure codes for storage
efficiency.

5.1.3 Master Replication


• Master state is replicated for reliability.
• Operation logs and checkpoints are stored across multiple machines.
• State mutations are committed only after log records are safely
stored.
• A primary master handles updates, while "shadow" masters provide
read-only access.
• Shadow masters slightly lag behind the primary but ensure
availability.

5.2 Data Integrity


• Chunkservers use checksumming to detect data corruption.
• Each chunk is divided into 64 KB blocks, with each having a 32-bit
checksum.
• Checksums are stored separately from user data and verified before
reads.
• If corruption is detected, the chunkserver reports it, and a new
replica is created.
• Overhead is minimized by aligning reads with checksum block
boundaries.
• Background scans detect corruption in rarely accessed chunks.

5.3 Diagnostic Tools


• Extensive logging aids debugging, problem isolation, and
performance analysis.
• Logs record significant events (e.g., chunkserver status changes) and
RPC interactions.
• Logs are written asynchronously to minimize performance impact.
• Logs can be deleted without affecting system correctness.
• Recent events are stored in memory for real-time monitoring.

13
https://manybutfinite.com/post/how-the-kernel-manages-your-memory/

14

You might also like