0% found this document useful (0 votes)
14 views3 pages

GFD Summary

GFS is a scalable and fault-tolerant file system designed to meet Google's storage needs, characterized by large file sizes, high throughput, and a master-slave architecture. It employs relaxed consistency, automatic recovery mechanisms, and chunk replication to ensure data integrity and availability. GFS effectively supports large-scale applications by managing petabytes of data across numerous machines.

Uploaded by

MANGAL KALE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

GFD Summary

GFS is a scalable and fault-tolerant file system designed to meet Google's storage needs, characterized by large file sizes, high throughput, and a master-slave architecture. It employs relaxed consistency, automatic recovery mechanisms, and chunk replication to ensure data integrity and availability. GFS effectively supports large-scale applications by managing petabytes of data across numerous machines.

Uploaded by

MANGAL KALE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Introduction
GFS was created to meet Google’s unique storage needs, where conventional file systems proved
inefficient. The key characteristics that shaped its design include:
• Component Failures as the Norm: Hardware failures are frequent and must be managed
transparently.
• Large File Sizes: Most files are multi-gigabyte in size.
• Workload Characteristics: Workloads consist of large streaming reads, frequent appends,
and minimal random writes.
• High Throughput: The system prioritizes throughput over low latency.

2. Design Overview
GFS follows a master-slave architecture, where:
• Master Server: Maintains metadata and manages file system operations.
• Chunkservers: Store actual file data in fixed-size chunks (typically 64 MB) and replicate
them.
• Clients: Interact with the master for metadata and communicate directly with chunkservers
for data operations.
Key design decisions include:
• File Mutability: Files are mostly appended, not modified in place.
• Relaxed Consistency Model: System ensures high availability but does not strictly enforce
consistency.
• Automatic Recovery Mechanisms: Self-healing replication and rebalancing of chunks
across chunkservers.

3. Architecture
3.1 Master Server
• Stores namespace, file-to-chunk mapping, and chunk metadata.
• Keeps all metadata in memory for fast access.
• Logs changes persistently and periodically checkpoints the state.
• Assigns and reassigns chunks to chunkservers dynamically.

3.2 Chunkservers
• Store file chunks, each identified by a unique 64-bit chunk handle.
• Chunks are replicated (default: 3 replicas) for fault tolerance.
• Periodically communicate with the master to report chunk health.

3.3 Clients
• Query the master for chunk locations and cache this information.
• Interact directly with chunkservers for reading/writing data.
• Minimize interaction with the master to reduce bottlenecks.
4. Data Consistency Model
4.1 Consistency Guarantees
GFS provides relaxed consistency, meaning:
• Writes are atomic at the chunk level but not always immediately consistent.
• The system ensures eventual consistency, meaning a consistent state is reached given
sufficient time.

4.2 Types of Writes


• Record Append: Clients append data to a file, and GFS guarantees the data is written at
least once.
• Write: Overwrites part of a chunk, leading to possible inconsistencies across replicas.

5. System Interactions
5.1 File Reads
1. Client requests chunk location from master.
2. Master returns chunkserver locations.
3. Client reads directly from the chunkserver.

5.2 File Writes


1. Master designates a primary chunkserver.
2. Client writes to all replicas.
3. Primary applies changes, followed by secondaries.
4. Primary notifies client upon successful replication.

5.3 Record Append


1. Master assigns primary chunkserver.
2. Primary chooses an offset and writes data.
3. If a replica fails, append may succeed partially, leading to duplicate records.

6. Fault Tolerance
6.1 Chunk Replication
• Chunks are replicated across multiple chunkservers.
• Master ensures replication levels are maintained.

6.2 Master Recovery


• Master state is checkpointed frequently and can be reconstructed from logs.

6.3 Data Integrity


• Each chunk has checksums stored separately.
• Clients validate data using checksums to detect corruption.

7. Performance Optimizations
7.1 Caching
• Clients cache metadata to reduce master load.

7.2 Load Balancing


• Master distributes chunks based on storage and workload patterns.

7.3 Rebalancing
• Underutilized chunkservers are assigned additional chunks.

7.4 Garbage Collection


• Deleted files are marked for deletion and purged later.

8. Real-World Deployment
• GFS powers Google’s large-scale applications, including indexing and data processing tasks.
• Handles petabytes of data across thousands of machines.

9. Conclusion
GFS is a highly scalable and fault-tolerant file system tailored to Google’s needs. Its design
principles, including relaxed consistency, replication, and self-healing mechanisms, make it well-
suited for large-scale distributed data processing.

You might also like