The Google File System Final
The Google File System Final
Introduction
Google – search engine.
Applications process lots of data.
Need good file system.
Solution: Google File System (GFS).
Motivational Facts
More than 15,000 commodity-class PC's.
Multiple clusters distributed worldwide.
Thousands of queries served per second.
One query reads 100's of MB of data.
One query consumes 10's of billions of CPU cycles.
Google stores dozens of copies of the entire Web!
Topics
Design Motivations
Architecture
Read/Write/Record Append
Fault-Tolerance
Performance Results
Design Motivations
1. Fault-tolerance and auto-recovery need to be
built into the system.
2. Standard I/O assumptions (e.g. block size)
have to be re-examined.
3. Record appends are the prevalent form of
writing.
4. Google applications and GFS should be co-
designed.
GFS Architecture
What is a chunk?
Analogous to block, except larger.
Size: 64 MB!
Stored on chunkserver as file
Chunk handle (~ chunk file name) used to
reference chunk.
Chunk replicated across multiple chunkservers
Note: There are hundreds of chunkservers in a
GFS cluster distributed over multiple racks.
GFS Architecture
What is a master?
A single process running on a separate
machine.
Stores all metadata:
File namespace
File to chunk mappings
Chunk location information
Access control information
Chunk version numbers
Etc.
GFS Architecture
Master <-> Chunkserver Communication:
Master and chunkserver communicate
regularly to obtain state:
Is chunkserver down?
Are there disk failures on chunkserver?
Are any replicas corrupted?
Which chunk replicas does chunkserver store?
Master sends instructions to chunkserver:
Delete existing chunk.
Create new chunk.
GFS Architecture
Serving Requests:
Client retrieves metadata for operation
from master.
Read/Write data flows between client
and chunkserver.
Single master is not bottleneck,
because its involvement with read/write
operations is minimized.
Overview
Design Motivations
Architecture
Master
Chunkservers
Clients
Read/Write/Record Append
Fault-Tolerance
Performance Results
And now for the Meat…
Read Algorithm
Application
1
2
(file name, byte range) (file name,
chunk index)
Master
GFS Client
(chunk handle,
replica locations)
3
Read Algorithm
Chunk Server
Application 4
(chunk handle,
byte range)
Chunk Server
6 (data from file)
Read Algorithm
1. Application originates the read request.
2. GFS client translates the request from (filename,
byte range) -> (filename, chunk index), and sends
it to master.
3. Master responds with chunk handle and replica
locations (i.e. chunkservers where the replicas are
stored).
4. Client picks a location and sends the (chunk
handle, byte range) request to that location.
5. Chunkserver sends requested data to the client.
6. Client forwards the data to the application.
Read Algorithm (Example)
Indexer
1
2
Master Ch_1001
(crawl_99, 2048 bytes) (crawl_99, {3,8,12}
index: 3)
Ch_1002
GFS Client crawl_99 {1,8,14}
(ch_1003, Ch_1003
{chunkservers: {4,7,9}
4,7,9})
Chunk Server #4
4
Application
(ch_1003,
{chunkservers:
4,7,9})
Chunk Server #7
6 (2048 bytes of data)
Write Algorithm
Application
1
2
(file name, data) (file name,
chunk index)
Master
GFS Client
(chunk handle,
primary and 3
secondary replica
locations)
Write Algorithm
Primary
Chunk
Buffer
Application
(Data)
Secondary
Chunk
Buffer
(Data)
Write Algorithm
(write command,
serial order)
6 7
Primary
(Write Chunk
D1 | D2| D3| D4
command)
Application
5
Secondary
Chunk
D1 | D2| D3| D4
8
9 Primary
Chunk
(empty)
(response)
Application
(response)
Secondary
Chunk
(empty)
Write Algorithm
1. Application originates write request.
2. GFS client translates request from
(filename, data) -> (filename, chunk index),
and sends it to master.
3. Master responds with chunk handle and
(primary + secondary) replica locations.
4. Client pushes write data to all locations.
Data is stored in chunkservers’ internal
buffers.
5. Client sends write command to primary.
Write Algorithm
6. Primary determines serial order for data
instances stored in its buffer and writes the
instances in that order to the chunk.
7. Primary sends serial order to the
secondaries and tells them to perform the
write.
8. Secondaries respond to the primary.
9. Primary responds back to client.
Note: If write fails at one of chunkservers,
client is informed and retries the write.
Observations
Clients can read in parallel.
Clients can write in parallel.
Clients can append records in parallel.
Overview
Design Motivations
Architecture
Algorithms:
Read
Write
Record Append
Fault-Tolerance
Performance Results
Fault Tolerance
Fast Recovery: master and chunkservers are designed to restart
and restore state in a few seconds.
Chunk Replication: across multiple machines, across multiple
racks.
Master Mechanisms:
Log of all changes made to metadata.
Periodic checkpoints of the log.
Log and checkpoints replicated on multiple machines.
Master state is replicated on multiple machines.
“Shadow” masters for reading data if “real” master is down.
Data integrity:
Each chunk has an associated checksum.
Performance (Test Cluster)
Performance measured on cluster with:
1 master
16 chunkservers
16 clients
Server machines connected to central
switch by 100 Mbps Ethernet.
Same for client machines.
Switches connected with 1 Gbps link.
Conclusion
Design Motivations
Architecture
Algorithms:
Fault-Tolerance
Performance Results