0% found this document useful (0 votes)

41 views1 page

MIT 6.824 - Lecture 3 - GFS

The document summarizes key aspects of the Google File System (GFS) paper. It discusses how GFS is a distributed storage system built to handle large volumes of data across multiple servers. The system uses a single master that tracks metadata and large 64MB chunks of data distributed across multiple chunkservers. The design aims to provide good performance for large sequential workloads while maintaining fault tolerance through replication and consistency of data through relaxed consistency models.

Uploaded by

Sara Vana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views1 page

MIT 6.824 - Lecture 3 - GFS

Uploaded by

Sara Vana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Timilearning About

MIT 6.824: Lecture 3 - GFS

02 May 2020 · 10 min read

The Google File System paper is relevant to this course because GFS is an example of
distributed storage, which is a key abstraction in building distributed systems. Many
distributed systems are either distributed storage systems or systems built on top of
distributed storage.

Building distributed storage is a hard problem for a couple of reasons:

These systems are built to get a high performance when the volume of data is too
large for a single machine, which leads us to sharding (splitting) the data over
multiple servers.
Because multiple servers are involved, there will likely be more faults in the system.
To improve fault tolerance, the data is usually replicated across multiple machines.
Replication of data leads to potential inconsistencies in the data. A client could read
from a stale replica, for example.
The protocols for better consistency often lead to a lower performance, which is the
opposite of what we want.

This cycle, which leads back to performance, highlights the challenges in building
distributed systems. The GFS paper touches on these topics and discusses the trade-
offs that were made to yield good performance in a production-ready distributed
storage system.

Table of Contents

Paper Summary
Design Overview
Single Master
Chunk Size
Metadata
Consistency Model
Record Appends
System Interactions
Writes
Atomic Record Appends
Fault Tolerance
Data Integrity
Conclusion
Further Reading

Paper Summary #
The system was built at a time when Google needed a system to meet its data-
processing demands with the goals of achieving good performance while being:

Global: Not tailored for just one application, but available to many Google
applications.
Fault Tolerant : Designed to account for component failures by default.
Scalable: It could expand to meet increasing storage needs by adding extra servers.

Also note that it was tailored for a workload that largely consisted of sequential access
to huge files (read or append). It was not built to optimize for low-latency requests;
rather, it was meant for batch workloads which often read a file sequentially, as in
MapReduce jobs.

Design Overview #

A GFS Cluster is made up of a single master and multiple chunkservers, and is

accessed by multiple clients, as shown in the figure below.

Breakdown of the architecture:

A client and a chunkserver may reside on the same machine, provided it has enough
resources.

A file stored in GFS is broken into chunks. A chunk is identified by an immutable and
globally unique chunk handle.

A chunk can be replicated across multiple chunkservers. It is configured to be

replicated across three chunkservers by default.

The master keeps track of all the filesystem metadata. It knows how a file is split into
chunks, and keeps track of what chunkservers hold each chunk.

It is also responsible for garbage collection of orphaned chunks (when a file is

deleted), and the migration of chunks between chunk servers for rebalancing load.

The master communicates with each chunkserver via Heartbeat operations to pass
instructions to it and collects its state.

There is a GFS Client library that is linked into each application using GFS. The library
handles communication with the master and chunkservers to read and write data on
behalf of the application.

The chunkservers do not perform any form of caching but instead rely on the Linux
buffer cache which keeps frequently accessed data in memory.

An interesting design choice made in this system is the decoupling of the data flow
from the control flow. The GFS client only communicates with the master for metadata
operations, but all data-bearing communications (reads and writes) go directly to the
chunkservers. I'll explain how that works next.

Single Master #

As noted earlier, there is a single master in a GFS cluster which clients only interact
with to retrieve metadata. This section highlights the role of the master in decoupling
the data flow from the control flow.

To read the data for a file:

The client first communicates with the master, sending it a request containing the file
name and the chunk index. The client derives the chunk index from a combination of
the file name and the byte offset the application wants to read from.

The master replies with the corresponding chunk handle and the location of its
replicas.

The client caches this information using the file name and chunk index as the key.

The client then sends a request to one of the replicas specified by the client, usually
the one closest to it, specifying the chunk handle and the byte range for the
requested data.

By caching the information from the master, further reads to the same chunk do not
require any more client-master interactions until the cached information expires or
the file is reopened.

Chunk Size #

In typical Linux filesystems, a file is split into blocks and those blocks usually range
from 0.5-65 kilobytes in size, with the default on most file systems being 4 kilobytes.

A block size is the unit of work for the file system, which means reading or writing any
files is done in multiples of that block size.

In GFS, chunks are analogous to blocks, except that chunks are of a much larger size
(64 MB). Having a large chunk size offers several advantages in this system:

The client will not need to interact with the master as much, since reads and writes
on the same chunk will require only one initial request to the master to get the chunk
location information, and more data will fit on a single chunk.

With large chunk sizes, a client is more likely to perform many operations on a given
chunk, and so we can reduce network overhead by keeping a persistent TCP
connection to the chunkserver over an extended period of time

Third, it can reduce the size of the metadata stored on the master, since it's keeping
track of fewer chunks.

Google uses lazy space allocation to avoid wasting space due to internal
fragmentation. Internal fragmentation means having unused portions of the 64 MB
chunk. For example, if we allocate a 64 MB chunk and only fill up 10 MB, that's a lot of
unused space.

According to this Stack Overflow answer,

Lazy space allocation means that the physical allocation of space is delayed as
long as possible, until data at the size of the chunk size is accumulated.

From the rest of that answer, I think what this means is that the decision to allocate a
new chunk is based solely on the data available, as opposed to using another
partitioning scheme to allocate data to chunks.

This does not mean the chunks will always be filled up. A chunk which contains the file
region for the end of a file will typically only be partially filled up.

Metadata #

The master stores three types of metadata in memory:

File and chunk namespaces (i.e. directory hierarchy.)

The mapping from files to chunks.

The location of each chunk's replica.

The first two types listed are also persisted on the master's local disk. The third is not
persisted; instead, the master asks each chunkserver about its chunks at master
startup and when a chunkserver joins the cluster.

By having the chunkserver as the ultimate source of truth of each chunk's location,
GFS eliminates some of the challenges of keeping the master and chunkservers in sync
regularly.

The master keeps an operation log, where it stores the namespace and file-to-chunk
mappings on local disk. It replicates this operation log on several machines, and GFS
does not make changes to the metadata visible to clients until they have been
persisted on all replicas.

After startup, the master can restore its file system state by replaying the operation
log. It keeps this log small to minimize the startup time by periodically checkpointing it.

Consistency Model #

The consistency guarantee for GFS is relaxed. It does not guarantee that all the
replicas of a chunk are byte-wise identical. What it does guarantee is that every piece
of data stored will be written at least once on each replica. This means that a replica
may contain duplicates, and it is up to the application to deal with such anomalies.

From Table 1 above:

A file region is consistent when all the clients will see the same data for it, regardless
of which replica they read from.
After a file data mutation, a region is defined if it is consistent and all the clients will
see the effect of the mutation in its entirety.

The data mutations here may be writes or record appends. A write occurs when data is
written at a file offset specified by the application.

Record Appends #

Record appends cause data to be written atomically at least once even in the presence
of concurrent mutations, but at an offset chosen by GFS.

If a record append succeeds on some replicas and fails on others, those successful
appends are not rolled back. This means that if the client retries the operation, the
successful replicas may have duplicates for that record.

Retrying the record append at a new file offset could mean that the offset chosen for
the initial failed append operation is now blank in the file regions of the failed replicas;
that is, if the region has not been modified before the retry. This blank region is known
as a padding, and the existence of padding and duplicates in replicas are what make
them inconsistent.

Applications that use GFS are left with the responsiblilty of dealing with these
inconsistent file regions. These applications can include a unique ID with each record
to filter out duplicates, and use checksums to detect and discard extra padding.

There is also the possibility of a client reading from a stale replica. Each chunk replica
is given a version number that gets increased for each successful mutation. If the
chunkserver hosting a chunk replica is down during a mutation, the chunk replica will
become stale and will have an older version number. Stale replicas are not given to
clients when they ask the master for the location of a chunk, and they are not involved
in mutations either.

Despite this, because a client caches the location of a chunk, it may read from a stale
replica before the information is refreshed. The impact of this is low because most
operations to a chunk are append-only. This means that a stale replica usually returns
a premature end of chunk, rather than outdated data for a value.

System Interactions #

This section describes in more detail how the client, master and chunkservers interact
to implement data mutations and atomic record appends.

Writes #

When the master receives a modification operation for a particular chunk, the following
happen:

a) The master finds the chunkservers which hold that chunk and grants a chunk lease
to one of them.

The server with the lease is called the primary, while the others are secondaries.
The primary determines the order in which mutations are applied to all the replicas.

b) After the lease expires (typically after 60 seconds), the master is free to grant
primary status to a different server for that chunk.

The master may sometimes try to revoke a lease before it expires (e.g. to prevent the
mutation of a file when the file is being renamed. )
The primary may also request an indefinite extension of the lease as long as the
chunk is still being modified.

c) The master may lose communication with a primary while the mutation is still
happening. If this happens, it is fine for the master to grant a new lease to another
replica as long as the lease timeout has expired.

Let's look at Figure 2 which illustrates the control flow of a write operation.

The numbered steps below correspond to each number in the diagram.

 The client asks the master for all chunkservers.

 The master grants a new lease to a replica (if none exist), increases the chunk
version number, and tells all replicas to do the same after the mutation has been
applied. It then replies to the client. After this, the client no longer has to talk to the
master.

 The client pushes the data to all the chunkservers, not necessarily to the primary
first. The servers will initially store this data in an internal LRU buffer cache until the
data is used.

 Once the client receives the acknowledgement that this data has been pushed
successfully, it sends the write request to the primary chunkserver. The primary
decides what serial order to apply the mutations in and applies them to the chunk.

 After applying the mutations, the primary forwards the write request and the serial
number order to all the secondaries for them to apply in the same order.

 All secondaries reply to the primary once they have completed the operation.

 The primary replies to the client, indicating whether the operation was a success or
an error. Note:
If the write succeeds at the primary but fails at any of the secondaries, we'll
have an inconsistent state and an error is returned to the client.
The client can retry steps 3 through 7.

Atomic Record Appends #

The system interactions for record appends are largely the same as discussed for
writes, with the following exceptions:

In step 4, the primary first checks to see if appending the record to the current chunk
would exceed the maximum size of 64MB. If so, the primary pads the chunk, notifies
the secondaries to the same, and then tells the client to retry the request on the next
chunk.

If the record append fails on any of the replicas, the client must retry the operation.
As discussed in the Consistency section, this means that replicas of the same chunk
may contain duplicates.

A record append is successful only when the data has been written at the same
offset on all the replicas of a chunk.

Fault Tolerance #

Fault Tolerance is achieved in GFS by implementing:

Fast Recovery - The master and the chunkservers are designed to restore their state
and start in a matter of seconds.

Chunk Replication: Each chunk is replicated on multiple chunkservers on different

racks. This ensures that some replicas are still available even if a rack is destroyed.
The master is able to clone existing replicas as needed when chunkservers go offline
or a replica is detected as stale or corrupted.

Master Replication: The master is also replicated for reliability. A state mutation is
considered committed only when the operation log has been flushed to disk on all
master replicas. When the master fails, it can restart almost immediately.

In addition, there are shadow masters which provide read-only access to the
filesystem when the file is down. There may be a lag in replicating data from the
primary master to its shadows, but these shadow masters help to improve
availability.

Data Integrity #

Checksumming is used by each chunkserver to detect the corruption of stored data.

From the course website [1]:

A checksum algorithm takes a block of bytes as input and returns a single

number that's a function of all the input bytes. For example, a simple checksum
might be the sum of all the bytes in the input (mod some big number). GFS
stores the checksum of each chunk as well as the chunk.

When a chunkserver writes a chunk on its disk, it first computes the checksum
of the new chunk, and saves the checksum on disk as well as the chunk. When a
chunkserver reads a chunk from disk, it also reads the previously-saved
checksum, re-computes a checksum from the chunk read from disk, and checks
that the two checksums match.

If the data was corrupted by the disk, the checksums won't match, and the
chunkserver will know to return an error. Separately, some GFS applications
stored their own checksums, over application-defined records, inside GFS files,
to distinguish between correct records and padding. CRC32 is an example of a
checksum algorithm.

[1] GFS FAQ - Lecture Notes from MIT 6.824

Conclusion #
This week's material brought some interesting ideas in the design of a distributed
storage system. These include:

The decoupling of data flow from control flow.

Using large chunk sizes to reduce overhead.
The sequencing of writes through a primary replica.

However, having a single master eventually became less than ideal for Google's use
case. As the number of files stored increased by thousands, it became harder to fit all
the metadata for those files on the master. In addition, the number of clients also
increased, leading to too much CPU load on the master.

Another challenge with GFS at Google was that the weak consistency model meant
applications had to be designed to cope with those limitations. These limitations led to
the creation of Colossus as a successor to GFS.

Further Reading #
Lecture 3: GFS - MIT 6.824 Lecture Notes.
The Google File System by Firas Abuzaid (Stanford Lecture Notes)
GFS: The Google File System by Brad Karp (UCL Lecture Notes)
GFS: Evolution on Fast-forward - 2009 Interview with a Google engineer about the
origin and evolution of the Google File System.

mit-6.824 distributed-systems learning-diary

A small favour
Did you find anything I wrote confusing, outdated, or incorrect? Please
let me know by writing a few words below.

Your name

Your email address

What should I know?

Send Message

Follow along
To get notified when I write something new, you can subscribe to the
RSS feed or enter your email below.

Your email address Subscribe

← Home

Google File System Paper - Summary
50% (2)
Google File System Paper - Summary
4 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
ATM-IsO8583 Installation Guide - R09
No ratings yet
ATM-IsO8583 Installation Guide - R09
11 pages
Automated Parking System
88% (17)
Automated Parking System
28 pages
Case Study: Google File System
No ratings yet
Case Study: Google File System
7 pages
Questions On Google File System
100% (1)
Questions On Google File System
3 pages
Bda Material Unit 2
No ratings yet
Bda Material Unit 2
19 pages
BDA Unit-1
No ratings yet
BDA Unit-1
19 pages
Google File System 1
No ratings yet
Google File System 1
48 pages
Google File System
No ratings yet
Google File System
48 pages
Unit 2
No ratings yet
Unit 2
22 pages
R16 4-1 BDA - Unit-2 (Ref-3)
No ratings yet
R16 4-1 BDA - Unit-2 (Ref-3)
22 pages
Google File System
No ratings yet
Google File System
20 pages
DS Mod 5.2
No ratings yet
DS Mod 5.2
6 pages
The Google File System: Firas Abuzaid
No ratings yet
The Google File System: Firas Abuzaid
22 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
9238 DC Assignment 3
No ratings yet
9238 DC Assignment 3
5 pages
Google File System and Hadoop Distributed File System-An Analogy
No ratings yet
Google File System and Hadoop Distributed File System-An Analogy
11 pages
2 GFS
No ratings yet
2 GFS
30 pages
Unit 2 PDF
No ratings yet
Unit 2 PDF
22 pages
Google File System Basics: Google World Wide Web Computers
No ratings yet
Google File System Basics: Google World Wide Web Computers
5 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
GPS Vs Hdfs
No ratings yet
GPS Vs Hdfs
6 pages
GFD Summary
No ratings yet
GFD Summary
3 pages
Unit 5 Lecture 2
No ratings yet
Unit 5 Lecture 2
22 pages
Chapter 2 1712934164766
No ratings yet
Chapter 2 1712934164766
21 pages
BDA Unit I
No ratings yet
BDA Unit I
18 pages
Google File System
No ratings yet
Google File System
22 pages
Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
No ratings yet
Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
9 pages
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
No ratings yet
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
52 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Google File System
No ratings yet
Google File System
6 pages
BDA Complete Notes
100% (1)
BDA Complete Notes
88 pages
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
No ratings yet
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
33 pages
Programming Support of Google App Engine
No ratings yet
Programming Support of Google App Engine
8 pages
15 Gfs
No ratings yet
15 Gfs
40 pages
Unit 3.4 Gfs and Hdfs
No ratings yet
Unit 3.4 Gfs and Hdfs
4 pages
A Review On GOOGLE File System
No ratings yet
A Review On GOOGLE File System
4 pages
Paper Gfs Summary
No ratings yet
Paper Gfs Summary
14 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
The Google File System
No ratings yet
The Google File System
21 pages
Google File System
No ratings yet
Google File System
9 pages
Google File System (GFS)
No ratings yet
Google File System (GFS)
18 pages
Gfs Google File System 13331
No ratings yet
Gfs Google File System 13331
28 pages
36 DC Expt9
No ratings yet
36 DC Expt9
4 pages
M4 - 05 - Google File System
No ratings yet
M4 - 05 - Google File System
28 pages
Case Study
No ratings yet
Case Study
6 pages
What Is Distributed Data Processing?
No ratings yet
What Is Distributed Data Processing?
2 pages
Large Scale Distributed File System Survey
No ratings yet
Large Scale Distributed File System Survey
7 pages
1564-Article Text-2810-1-10-20171231 PDF
No ratings yet
1564-Article Text-2810-1-10-20171231 PDF
5 pages
The Google File System Final
No ratings yet
The Google File System Final
20 pages
Refer Slide Time: 00:15
No ratings yet
Refer Slide Time: 00:15
31 pages
Chapter 2 Google File System 250525 070947
No ratings yet
Chapter 2 Google File System 250525 070947
42 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
Lecture 14 HDFS GFS
No ratings yet
Lecture 14 HDFS GFS
30 pages
Chunky
No ratings yet
Chunky
3 pages
UCS15E08 - Cloud Computing - Unit 3 Notes
No ratings yet
UCS15E08 - Cloud Computing - Unit 3 Notes
13 pages
Chap 6
No ratings yet
Chap 6
54 pages
Unit-II (BIG DATA)
No ratings yet
Unit-II (BIG DATA)
9 pages
The Google File System: 1. Abstract
No ratings yet
The Google File System: 1. Abstract
9 pages
The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
No ratings yet
The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
18 pages
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Interview Questions Embedded Software Developer
No ratings yet
Interview Questions Embedded Software Developer
240 pages
Difference Between Hub and Bridge
No ratings yet
Difference Between Hub and Bridge
5 pages
Product Data Sheet 6AV6647 0AC11 3AX0
No ratings yet
Product Data Sheet 6AV6647 0AC11 3AX0
9 pages
Jbase Jbasic Hello World
No ratings yet
Jbase Jbasic Hello World
14 pages
Raw Log
No ratings yet
Raw Log
17 pages
DRIVER Description
No ratings yet
DRIVER Description
8 pages
Embedded System and Matlab SIMULINK PDF
No ratings yet
Embedded System and Matlab SIMULINK PDF
31 pages
Dump 30 Ques
46% (13)
Dump 30 Ques
41 pages
WPI Log 2021.06.15 16.18.22
No ratings yet
WPI Log 2021.06.15 16.18.22
4 pages
Access Control Software Operation Guide
No ratings yet
Access Control Software Operation Guide
107 pages
Syllabus FPGA B Tech
No ratings yet
Syllabus FPGA B Tech
1 page
Protocol Support Management
No ratings yet
Protocol Support Management
1 page
Mobile Robot Chapter 4: C and 8051 (V4.a)
No ratings yet
Mobile Robot Chapter 4: C and 8051 (V4.a)
10 pages
Tape Rotation
0% (1)
Tape Rotation
21 pages
2025-05-16
No ratings yet
2025-05-16
6 pages
कम्प्यूटर सामान्य ज्ञान-2
No ratings yet
कम्प्यूटर सामान्य ज्ञान-2
17 pages
Network Models: Dr. R. B. Patel
No ratings yet
Network Models: Dr. R. B. Patel
37 pages
Homeworks Qs Software Manual
100% (2)
Homeworks Qs Software Manual
4 pages
SIMDRAM
No ratings yet
SIMDRAM
17 pages
Bit Comet - Configurare Firewall
No ratings yet
Bit Comet - Configurare Firewall
4 pages
Early History of Computer
No ratings yet
Early History of Computer
3 pages
Computer Science HSSC I Model Paper
No ratings yet
Computer Science HSSC I Model Paper
4 pages
OnCommand System Manager 3.1.1
No ratings yet
OnCommand System Manager 3.1.1
310 pages
Beginning Stm32: Developing With Freertos, Libopencm3 and GCC
No ratings yet
Beginning Stm32: Developing With Freertos, Libopencm3 and GCC
18 pages
Ese 2023 Coa
No ratings yet
Ese 2023 Coa
4 pages
Cloud Computing - Chapter 2
No ratings yet
Cloud Computing - Chapter 2
47 pages
An1133 Dynamic Multiprotocol Bluetooth Zigbee
No ratings yet
An1133 Dynamic Multiprotocol Bluetooth Zigbee
24 pages
WinDSX User Guide
No ratings yet
WinDSX User Guide
116 pages

MIT 6.824 - Lecture 3 - GFS

Uploaded by

MIT 6.824 - Lecture 3 - GFS

Uploaded by

Timilearning About

MIT 6.824: Lecture 3 - GFS

Building distributed storage is a hard problem for a couple of reasons:

A GFS Cluster is made up of a single master and multiple chunkservers, and is

Breakdown of the architecture:

A chunk can be replicated across multiple chunkservers. It is configured to be

It is also responsible for garbage collection of orphaned chunks (when a file is

To read the data for a file:

According to this Stack Overflow answer,

The master stores three types of metadata in memory:

File and chunk namespaces (i.e. directory hierarchy.)

The mapping from files to chunks.

The location of each chunk's replica.

From Table 1 above:

The numbered steps below correspond to each number in the diagram.

 The client asks the master for all chunkservers.

Atomic Record Appends #

Fault Tolerance is achieved in GFS by implementing:

Chunk Replication: Each chunk is replicated on multiple chunkservers on different

Checksumming is used by each chunkserver to detect the corruption of stored data.

From the course website [1]:

A checksum algorithm takes a block of bytes as input and returns a single

[1] GFS FAQ - Lecture Notes from MIT 6.824

The decoupling of data flow from control flow.

mit-6.824 distributed-systems learning-diary

Your email address

What should I know?

Your email address Subscribe

You might also like