0% found this document useful (0 votes)
488 views44 pages

Distributed File System Google File System

The document discusses the Google File System (GFS), which is designed for scalable and reliable distributed file storage, addressing challenges like node failures and data consistency. It outlines the architecture, operations, and implementation details, including chunk management, replication, and fault tolerance mechanisms. GFS is tailored for large data-intensive applications, emphasizing high availability and performance through its unique design and operational strategies.

Uploaded by

ayush gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
488 views44 pages

Distributed File System Google File System

The document discusses the Google File System (GFS), which is designed for scalable and reliable distributed file storage, addressing challenges like node failures and data consistency. It outlines the architecture, operations, and implementation details, including chunk management, replication, and fault tolerance mechanisms. GFS is tailored for large data-intensive applications, emphasizing high availability and performance through its unique design and operational strategies.

Uploaded by

ayush gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Distributed File System

Google File System

1
DFS

 Single node architecture that looked at data


fitting in memory
 Advanced to: data on disk and processing
part of it as you bring into memory
 Consider larger files with more processing
needed

2
Split data into chunks and use multiple disks and CPUs.
Say 1000 CPU’s then can do in 4000 s … an hour

3
4
Challenges

 Nodes can fail. If a node doesn’t fail for 3


years (1000 days) then with 1M servers will
give 1000 failures per day
 Persistent data not possible if data lost on
node failure
 Availability compromised if nodes fail
 Network can be bottleneck so should not
move data around too much.
5  Complexity of distributed programming
Solutions :

DFS: takes care of storing data taking care of


redundancy and availability
HDFS, GFS

6
 Huge data so sharding needed
 If sharding over so many machines then few
will always be down -> Hence replicas
needed
 If replicas -> have to take care they are
consistent
 Consistency compromises on performance

7
The File System
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
(Google)

8
GFS Motivation

 Need for a scalable DFS


 Large distributed data-intensive applications
 High data processing needs
 Performance, Reliability, Scalability and
Availability
 More than traditional DFS

9
Assumptions –
Environment

 Commodity Hardware
– inexpensive

 Component Failure
– the norm rather than the exception. application
bugs, OS bugs, failures of
disks/memory/connectors/networking/power
supplies.

 TBs of Space
10 – must support TBs of space
Assumptions –
Applications

 Multi-GB files
• Common

 Workloads
• Large streaming reads
• Small random reads
• Large, sequential writes that append data to file
• Multiple clients concurrently append to one file

• High sustained bandwidth


• More important than low latency

11
Architecture

 Files are divided into chunks

 Fixed-size chunks (64MB)

 Replicated over chunkservers, called replicas

 Unique 64-bit chunk handles

 Chunks as Linux files

12
Architecture

 Single master

 Multiple chunkservers
– Grouped into Racks
– Connected through switches

 Multiple clients

 Master/chunkserver coordination
– HeartBeat messages

13
Architecture

 Contact single master


 Obtain chunk locations
 Contact one of chunkservers
 Obtain data

14
Master
 Metadata
– Three types
 File & chunk namespaces(handles) --logged
 Mapping from files to chunks -logged
 Locations of chunks’ replicas(if master dies it restarts and asks chunk servers as
to what they have because chunk server disks too may spoil or go bad losing
chunk information), version number of each(logged), primary replica, expiration
time.
– Replicated on multiple remote machines
– Kept in memory

 Operations
– Replica placement
– New chunk and replica creation
– Load balancing
– Unused storage reclaim
15
Flow

 Client using the fixed chunksize, translates the file name and
byte offset specified by the application into a chunkindex within
the file.
 Then, it sends the master a request containing the file name
and chunk index.
 The master replies with the corresponding chunk handle and
locations of the replicas. The client caches this information
using the file name and chunk index as the key.
 The client then sends a request to one of the replicas specifying
the chunk handle and a byte range within that chunk.

16
Operations
Replica placement :
Chunk replicas are spread across racks so that chunks survive even
if racks damaged or offline.

Unused storage reclaim :


File deleted by application is marked and renamed to hidden file.
Files hidden for three days are removed during regular scan by
master.
Similarly orphaned chunks (those not reachable by any file) are
removed.

In a HeartBeat message regularly exchanged with the master, each


chunkserver reports the chunks it has, and the master replies with
the identity of all chunks that are no longer present in the master’s
metadata. The chunkserver is free to delete its replicas of such
17 chunks.
Operations:
New chunk and replica creation :

Want to place chunks on servers with less than average disk space utilization.
Because typically chunks are created when a write operation is to follow.

Replication created when number of replicas less than 3 if current replica is


unavailable or reported to have errors or corrupted or replication need goes up.

Which ones to replicate first depends on how many replicas left, which files
live,which is blocking client pipeline, etc.

master selects new chunkserver balancing load and on different rack and asks it to
copy from valid replica

the master rebalances replicas periodically: it examines the current replica distribution
and moves replicas for better diskspace and load balancing.

18
Read

 Client Asks master with filename and offset

 For each chunk: Master responds with chunk handle and chunk
replica servers

 Client caches all this information for repeated use

 Client contacts one of chunk servers to get the data.

19
Implementation –
Consistency Model

 Relaxed consistency model


 Two types of mutations
– Writes
 Cause data to be written at an application-specified file offset
– Record appends
 Operations that append data to a file
 Cause data to be appended atomically at least once
 Offset chosen by GFS, not by the client

 States of a file region after a mutation


– Consistent
 All clients see the same data, regardless which replicas they read from
– Defined
 consistent + all clients see what the mutation writes in its entirety
– Undefined
 consistent +but it may not reflect what any one mutation has written
– Inconsistent
 Clients see different data at different times

20
Write

 Client asks master replica information which he caches. Also gives


primary and secondary information.
 If no primary then find uptodate replicas (by version number and make
one primary). Tell client of primary and secondary servers among
replicas. Increment version number of all
 Inform version number to Primary and secondaries . Master then
updates version number
 Primary picks the offset and informs secondaries to append at same
location
 Client sends data to replicas. Linearly closest one first data is passed.
Data stored in cache in all replicas.
 Once all replicas acknowledge receiving data: client sends primary
write request. Primary gives serial number to all write requests to
execute them in order. Executes himself and sends requests to
21 secondaries.
Write

 Secondary executes writes in serial order and confirms back to


primary.
 If primary gets a yes from everyone about having appended
then primary returns success to client. Else returns no to
client(data region is inconsistent) who restarts the procedure.

22
What can go wrong

 Serial writes – defined.



Primary succeeded to write but secondary did not – inconsistent data

 Concurrent writes - If everyone says yes- consistent but still could be


undefined because of below:

- Concurrent writes – each gets a start index concurrently but the


streaming data though written serially from concurrent writes may
overwrite data making it consistent but undefined

- Large appends or data that straddles a chunk boundary may make this
happen.

23
 If a mutation is not interrupted by another
concurrent mutation then data is defined
 If Mutation is interrupted by another
concurrent mutation since mutation 1 was
given start position index x1 and mutation 2
was given index x2 then mutation1 did not
have space to write out entirely what it
wanted to. In such a case we have undefined
data or mingled fragments
24
 Can have duplicates (content repeated)
 Can have blank spaces
 Can have data in only two of 3 replicas if
client dies

25
Implementation –
Leases and Mutation Order

 Master uses leases to maintain a consistent mutation order


among replicas

 Primary is the chunkserver who is granted a chunk lease

 All others containing replicas are secondaries

 Primary defines a mutation order between mutations

 All secondaries follows this order

26
Implementation –
Writes

Mutation Order
 identical replicas
 File region may end up
containing mingled
fragments from different
clients (consistent but
undefined)

27
Implementation –
Atomic Appends

 The client specifies only the data

 Similar to writes
– Mutation order is determined by the primary
– All secondaries use the same mutation order

 GFS appends data to the file at least once atomically


– The chunk is padded if appending the record exceeds the
maximum size  padding
– If a record append fails at any replica, the client retries the
operation  record duplicates
– File region may be defined but interspersed with inconsistent

28
When data does not fit

 When data won’t fit in last chunk:


– Primary fills current chunk with padding
– Primary instructs other replicas to do same –
Primary replies to client, “retry on next chunk”

• If record append fails at any replica, client retries


operation
– So replicas of same chunk may contain different data
—even duplicates of all or part of record data

29
Other Issues –
Data flow

 Pipelined fashion
 Data transfer is pipelined over TCP connections
 Each machine forwards the data to the “closest” machine

 Benefits
– Avoid bottle necks and minimize latency

30
Other Issues –
Garbage Collection

 Deleted files
– Deletion operation is logged
– File is renamed to a hidden name, then may be removed
later or get recovered

 Orphaned chunks (unreachable chunks)


– Identified and removed during a regular scan of the chunk
namespace

 Stale replicas
 Chunk version numbering

31
Implementation –
Operation Log

 contains historical records of metadata changes

 replicated on multiple remote machines

 kept small by creating checkpoints

32
Other Issues –
Replica Operations

 Creation
– Disk space utilization
– Number of recent creations on each chunkserver
– Spread across many racks

 Re-replication
– Prioritized: How far it is from its replication goal…
– The highest priority chunk is cloned first by copying the chunk data
directly from an existing replica

 Rebalancing
– Periodically

33
Other Issues –
Fault Tolerance and Diagnosis

 Fast Recovery
– Operation log
– Checkpointing

 Chunk replication
– Each chunk is replicated on multiple chunkservers on different racks

 Master replication
– Operation log and check points are replicated on multiple machines

 Data integrity
– Checksumming to detect corruption of stored data
– Each chunkserver independently verifies the integrity

 Diagnostic logs
– Chunkservers going up and down
– RPC requests and replies

34
Current status

 Two clusters within Google


– Cluster A: R & D
 Read and analyze data, write result back to cluster
 Much human interaction
 Short tasks

– Cluster B: Production data processing


 Long tasks with multi-TB data
 Seldom human interaction

35
Implications for Applications

 Applications can use checksums to decide which areas readers can


access
 Primary could work on identifying that this is an old failed request and
try assigning same number
 Can eliminate damanged secondaries permanently
 If primary crashes after sending information to some secondaries then
they should sync up
 In read: read can happen from any replica including secondary.
Secondary could be stale.

36
Measurements
 Read rates much higher than write rates
 Both clusters in heavy read activity
 Cluster A supports up to 750MB/read, B: 1300 MB/s
 Master was not a bottle neck

Cluster A B
Read rate (last minute) 583 MB/s 380 MB/s
Read rate (last hour) 562 MB/s 384 MB/s
Read rate (since restart) 589 MB/s 49 MB/s
Write rate (last minute) 1 MB/s 101 MB/s
Write rate (last hour) 2 MB/s 117 MB/s
Write rate (since restart) 25 MB/s 13 MB/s
Master ops (last minute) 325 Ops/s 533 Ops/s
Master ops (last hour) 381 Ops/s 518 Ops/s
Master ops (since restart) 202 Ops/s 347 Ops/s

37
Implementation –
Snapshot*

 Goals
– To quickly create branch copies of huge data sets
– To easily checkpoint the current state

 Copy-on-write technique
– Metadata for the source file or directory tree is duplicated
– Reference count for chunks are incremented
– Chunks are copied later at the first write

38
Measurements

 Recovery time (of one chunkserver)


– 15,000 chunks containing 600GB are restored in
23.2 minutes (replication rate  400MB/s)

39
Review
 High availability and component failure
– Fault tolerance, Master/chunk replication, HeartBeat, Operation Log,
Checkpointing, Fast recovery

 TGs of Space
– 100s of chunkservers, 1000s of disks

 Networking
– Clusters and racks

 Scalability
– Simplicity with a single master
– Interaction between master and chunkservers is minimized

40
Review
 Multi-GB files
– 64MB chunks

 Sequential reads
– Large chunks, cached metadata, load balancing

 Appending writes
– Atomic record appends

 High sustained bandwidth


– Data pipelining
– Chunk replication and placement policies
– Load balancing

41
Benefits and Limitations

 Simple design with single master


 Fault tolerance
 Custom designed
 Only viable in a specific environment
 Limited security

42
Conclusion

 Different than previous file systems


 Satisfies needs of the application
 Fault tolerance

43
GFS Publication:
https://static.googleusercontent.com/media/research.google.com/en//archiv
e/gfs-sosp2003.pdf
MIT Topic discussion:
https://pdos.csail.mit.edu/6.824/papers/gfs-faq.txt
DFS: https://www.youtube.com/watch?v=xoA5v9AO7S0&t=1s

44

You might also like