Distributed File System Google File System
Distributed File System Google File System
1
DFS
2
Split data into chunks and use multiple disks and CPUs.
Say 1000 CPU’s then can do in 4000 s … an hour
3
4
Challenges
6
Huge data so sharding needed
If sharding over so many machines then few
will always be down -> Hence replicas
needed
If replicas -> have to take care they are
consistent
Consistency compromises on performance
7
The File System
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
(Google)
8
GFS Motivation
9
Assumptions –
Environment
Commodity Hardware
– inexpensive
Component Failure
– the norm rather than the exception. application
bugs, OS bugs, failures of
disks/memory/connectors/networking/power
supplies.
TBs of Space
10 – must support TBs of space
Assumptions –
Applications
Multi-GB files
• Common
Workloads
• Large streaming reads
• Small random reads
• Large, sequential writes that append data to file
• Multiple clients concurrently append to one file
11
Architecture
12
Architecture
Single master
Multiple chunkservers
– Grouped into Racks
– Connected through switches
Multiple clients
Master/chunkserver coordination
– HeartBeat messages
13
Architecture
14
Master
Metadata
– Three types
File & chunk namespaces(handles) --logged
Mapping from files to chunks -logged
Locations of chunks’ replicas(if master dies it restarts and asks chunk servers as
to what they have because chunk server disks too may spoil or go bad losing
chunk information), version number of each(logged), primary replica, expiration
time.
– Replicated on multiple remote machines
– Kept in memory
Operations
– Replica placement
– New chunk and replica creation
– Load balancing
– Unused storage reclaim
15
Flow
Client using the fixed chunksize, translates the file name and
byte offset specified by the application into a chunkindex within
the file.
Then, it sends the master a request containing the file name
and chunk index.
The master replies with the corresponding chunk handle and
locations of the replicas. The client caches this information
using the file name and chunk index as the key.
The client then sends a request to one of the replicas specifying
the chunk handle and a byte range within that chunk.
16
Operations
Replica placement :
Chunk replicas are spread across racks so that chunks survive even
if racks damaged or offline.
Want to place chunks on servers with less than average disk space utilization.
Because typically chunks are created when a write operation is to follow.
Which ones to replicate first depends on how many replicas left, which files
live,which is blocking client pipeline, etc.
master selects new chunkserver balancing load and on different rack and asks it to
copy from valid replica
the master rebalances replicas periodically: it examines the current replica distribution
and moves replicas for better diskspace and load balancing.
18
Read
For each chunk: Master responds with chunk handle and chunk
replica servers
19
Implementation –
Consistency Model
20
Write
22
What can go wrong
- Large appends or data that straddles a chunk boundary may make this
happen.
23
If a mutation is not interrupted by another
concurrent mutation then data is defined
If Mutation is interrupted by another
concurrent mutation since mutation 1 was
given start position index x1 and mutation 2
was given index x2 then mutation1 did not
have space to write out entirely what it
wanted to. In such a case we have undefined
data or mingled fragments
24
Can have duplicates (content repeated)
Can have blank spaces
Can have data in only two of 3 replicas if
client dies
25
Implementation –
Leases and Mutation Order
26
Implementation –
Writes
Mutation Order
identical replicas
File region may end up
containing mingled
fragments from different
clients (consistent but
undefined)
27
Implementation –
Atomic Appends
Similar to writes
– Mutation order is determined by the primary
– All secondaries use the same mutation order
28
When data does not fit
29
Other Issues –
Data flow
Pipelined fashion
Data transfer is pipelined over TCP connections
Each machine forwards the data to the “closest” machine
Benefits
– Avoid bottle necks and minimize latency
30
Other Issues –
Garbage Collection
Deleted files
– Deletion operation is logged
– File is renamed to a hidden name, then may be removed
later or get recovered
Stale replicas
Chunk version numbering
31
Implementation –
Operation Log
32
Other Issues –
Replica Operations
Creation
– Disk space utilization
– Number of recent creations on each chunkserver
– Spread across many racks
Re-replication
– Prioritized: How far it is from its replication goal…
– The highest priority chunk is cloned first by copying the chunk data
directly from an existing replica
Rebalancing
– Periodically
33
Other Issues –
Fault Tolerance and Diagnosis
Fast Recovery
– Operation log
– Checkpointing
Chunk replication
– Each chunk is replicated on multiple chunkservers on different racks
Master replication
– Operation log and check points are replicated on multiple machines
Data integrity
– Checksumming to detect corruption of stored data
– Each chunkserver independently verifies the integrity
Diagnostic logs
– Chunkservers going up and down
– RPC requests and replies
34
Current status
35
Implications for Applications
36
Measurements
Read rates much higher than write rates
Both clusters in heavy read activity
Cluster A supports up to 750MB/read, B: 1300 MB/s
Master was not a bottle neck
Cluster A B
Read rate (last minute) 583 MB/s 380 MB/s
Read rate (last hour) 562 MB/s 384 MB/s
Read rate (since restart) 589 MB/s 49 MB/s
Write rate (last minute) 1 MB/s 101 MB/s
Write rate (last hour) 2 MB/s 117 MB/s
Write rate (since restart) 25 MB/s 13 MB/s
Master ops (last minute) 325 Ops/s 533 Ops/s
Master ops (last hour) 381 Ops/s 518 Ops/s
Master ops (since restart) 202 Ops/s 347 Ops/s
37
Implementation –
Snapshot*
Goals
– To quickly create branch copies of huge data sets
– To easily checkpoint the current state
Copy-on-write technique
– Metadata for the source file or directory tree is duplicated
– Reference count for chunks are incremented
– Chunks are copied later at the first write
38
Measurements
39
Review
High availability and component failure
– Fault tolerance, Master/chunk replication, HeartBeat, Operation Log,
Checkpointing, Fast recovery
TGs of Space
– 100s of chunkservers, 1000s of disks
Networking
– Clusters and racks
Scalability
– Simplicity with a single master
– Interaction between master and chunkservers is minimized
40
Review
Multi-GB files
– 64MB chunks
Sequential reads
– Large chunks, cached metadata, load balancing
Appending writes
– Atomic record appends
41
Benefits and Limitations
42
Conclusion
43
GFS Publication:
https://static.googleusercontent.com/media/research.google.com/en//archiv
e/gfs-sosp2003.pdf
MIT Topic discussion:
https://pdos.csail.mit.edu/6.824/papers/gfs-faq.txt
DFS: https://www.youtube.com/watch?v=xoA5v9AO7S0&t=1s
44