0% found this document useful (0 votes)

488 views44 pages

Distributed File System Google File System

The document discusses the Google File System (GFS), which is designed for scalable and reliable distributed file storage, addressing challenges like node failures and data consistency. It outlines the architecture, operations, and implementation details, including chunk management, replication, and fault tolerance mechanisms. GFS is tailored for large data-intensive applications, emphasizing high availability and performance through its unique design and operational strategies.

Uploaded by

ayush gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

488 views44 pages

Distributed File System Google File System

Uploaded by

ayush gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Distributed File System

Google File System

1
DFS

 Single node architecture that looked at data

fitting in memory
 Advanced to: data on disk and processing
part of it as you bring into memory
 Consider larger files with more processing
needed

2
Split data into chunks and use multiple disks and CPUs.
Say 1000 CPU’s then can do in 4000 s … an hour

3
4
Challenges

 Nodes can fail. If a node doesn’t fail for 3

years (1000 days) then with 1M servers will
give 1000 failures per day
 Persistent data not possible if data lost on
node failure
 Availability compromised if nodes fail
 Network can be bottleneck so should not
move data around too much.
5  Complexity of distributed programming
Solutions :

DFS: takes care of storing data taking care of

redundancy and availability
HDFS, GFS

6
 Huge data so sharding needed
 If sharding over so many machines then few
will always be down -> Hence replicas
needed
 If replicas -> have to take care they are
consistent
 Consistency compromises on performance

7
The File System
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
(Google)

8
GFS Motivation

 Need for a scalable DFS

 Large distributed data-intensive applications
 High data processing needs
 Performance, Reliability, Scalability and
Availability
 More than traditional DFS

9
Assumptions –
Environment

 Commodity Hardware
– inexpensive

 Component Failure
– the norm rather than the exception. application
bugs, OS bugs, failures of
disks/memory/connectors/networking/power
supplies.

 TBs of Space
10 – must support TBs of space
Assumptions –
Applications

 Multi-GB files
• Common

 Workloads
• Large streaming reads
• Small random reads
• Large, sequential writes that append data to file
• Multiple clients concurrently append to one file

• High sustained bandwidth

• More important than low latency

11
Architecture

 Files are divided into chunks

 Fixed-size chunks (64MB)

 Replicated over chunkservers, called replicas

 Unique 64-bit chunk handles

 Chunks as Linux files

12
Architecture

 Single master

 Multiple chunkservers
– Grouped into Racks
– Connected through switches

 Multiple clients

 Master/chunkserver coordination
– HeartBeat messages

13
Architecture

 Contact single master

 Obtain chunk locations
 Contact one of chunkservers
 Obtain data

14
Master
 Metadata
– Three types
 File & chunk namespaces(handles) --logged
 Mapping from files to chunks -logged
 Locations of chunks’ replicas(if master dies it restarts and asks chunk servers as
to what they have because chunk server disks too may spoil or go bad losing
chunk information), version number of each(logged), primary replica, expiration
time.
– Replicated on multiple remote machines
– Kept in memory

 Operations
– Replica placement
– New chunk and replica creation
– Load balancing
– Unused storage reclaim
15
Flow

 Client using the fixed chunksize, translates the file name and
byte offset specified by the application into a chunkindex within
the file.
 Then, it sends the master a request containing the file name
and chunk index.
 The master replies with the corresponding chunk handle and
locations of the replicas. The client caches this information
using the file name and chunk index as the key.
 The client then sends a request to one of the replicas specifying
the chunk handle and a byte range within that chunk.

16
Operations
Replica placement :
Chunk replicas are spread across racks so that chunks survive even
if racks damaged or offline.

Unused storage reclaim :

File deleted by application is marked and renamed to hidden file.
Files hidden for three days are removed during regular scan by
master.
Similarly orphaned chunks (those not reachable by any file) are
removed.

In a HeartBeat message regularly exchanged with the master, each

chunkserver reports the chunks it has, and the master replies with
the identity of all chunks that are no longer present in the master’s
metadata. The chunkserver is free to delete its replicas of such
17 chunks.
Operations:
New chunk and replica creation :

Want to place chunks on servers with less than average disk space utilization.
Because typically chunks are created when a write operation is to follow.

Replication created when number of replicas less than 3 if current replica is

unavailable or reported to have errors or corrupted or replication need goes up.

Which ones to replicate first depends on how many replicas left, which files
live,which is blocking client pipeline, etc.

master selects new chunkserver balancing load and on different rack and asks it to
copy from valid replica

the master rebalances replicas periodically: it examines the current replica distribution
and moves replicas for better diskspace and load balancing.

18
Read

 Client Asks master with filename and offset

 For each chunk: Master responds with chunk handle and chunk
replica servers

 Client caches all this information for repeated use

 Client contacts one of chunk servers to get the data.

19
Implementation –
Consistency Model

 Relaxed consistency model

 Two types of mutations
– Writes
 Cause data to be written at an application-specified file offset
– Record appends
 Operations that append data to a file
 Cause data to be appended atomically at least once
 Offset chosen by GFS, not by the client

 States of a file region after a mutation

– Consistent
 All clients see the same data, regardless which replicas they read from
– Defined
 consistent + all clients see what the mutation writes in its entirety
– Undefined
 consistent +but it may not reflect what any one mutation has written
– Inconsistent
 Clients see different data at different times

20
Write

 Client asks master replica information which he caches. Also gives

primary and secondary information.
 If no primary then find uptodate replicas (by version number and make
one primary). Tell client of primary and secondary servers among
replicas. Increment version number of all
 Inform version number to Primary and secondaries . Master then
updates version number
 Primary picks the offset and informs secondaries to append at same
location
 Client sends data to replicas. Linearly closest one first data is passed.
Data stored in cache in all replicas.
 Once all replicas acknowledge receiving data: client sends primary
write request. Primary gives serial number to all write requests to
execute them in order. Executes himself and sends requests to
21 secondaries.
Write

 Secondary executes writes in serial order and confirms back to

primary.
 If primary gets a yes from everyone about having appended
then primary returns success to client. Else returns no to
client(data region is inconsistent) who restarts the procedure.

22
What can go wrong

 Serial writes – defined.


Primary succeeded to write but secondary did not – inconsistent data

 Concurrent writes - If everyone says yes- consistent but still could be

undefined because of below:

- Concurrent writes – each gets a start index concurrently but the

streaming data though written serially from concurrent writes may
overwrite data making it consistent but undefined

- Large appends or data that straddles a chunk boundary may make this
happen.

23
 If a mutation is not interrupted by another
concurrent mutation then data is defined
 If Mutation is interrupted by another
concurrent mutation since mutation 1 was
given start position index x1 and mutation 2
was given index x2 then mutation1 did not
have space to write out entirely what it
wanted to. In such a case we have undefined
data or mingled fragments
24
 Can have duplicates (content repeated)
 Can have blank spaces
 Can have data in only two of 3 replicas if
client dies

25
Implementation –
Leases and Mutation Order

 Master uses leases to maintain a consistent mutation order

among replicas

 Primary is the chunkserver who is granted a chunk lease

 All others containing replicas are secondaries

 Primary defines a mutation order between mutations

 All secondaries follows this order

26
Implementation –
Writes

Mutation Order
 identical replicas
 File region may end up
containing mingled
fragments from different
clients (consistent but
undefined)

27
Implementation –
Atomic Appends

 The client specifies only the data

 Similar to writes
– Mutation order is determined by the primary
– All secondaries use the same mutation order

 GFS appends data to the file at least once atomically

– The chunk is padded if appending the record exceeds the
maximum size  padding
– If a record append fails at any replica, the client retries the
operation  record duplicates
– File region may be defined but interspersed with inconsistent

28
When data does not fit

 When data won’t fit in last chunk:

– Primary fills current chunk with padding
– Primary instructs other replicas to do same –
Primary replies to client, “retry on next chunk”

• If record append fails at any replica, client retries

operation
– So replicas of same chunk may contain different data
—even duplicates of all or part of record data

29
Other Issues –
Data flow

 Pipelined fashion
 Data transfer is pipelined over TCP connections
 Each machine forwards the data to the “closest” machine

 Benefits
– Avoid bottle necks and minimize latency

30
Other Issues –
Garbage Collection

 Deleted files
– Deletion operation is logged
– File is renamed to a hidden name, then may be removed
later or get recovered

 Orphaned chunks (unreachable chunks)

– Identified and removed during a regular scan of the chunk
namespace

 Stale replicas
 Chunk version numbering

31
Implementation –
Operation Log

 contains historical records of metadata changes

 replicated on multiple remote machines

 kept small by creating checkpoints

32
Other Issues –
Replica Operations

 Creation
– Disk space utilization
– Number of recent creations on each chunkserver
– Spread across many racks

 Re-replication
– Prioritized: How far it is from its replication goal…
– The highest priority chunk is cloned first by copying the chunk data
directly from an existing replica

 Rebalancing
– Periodically

33
Other Issues –
Fault Tolerance and Diagnosis

 Fast Recovery
– Operation log
– Checkpointing

 Chunk replication
– Each chunk is replicated on multiple chunkservers on different racks

 Master replication
– Operation log and check points are replicated on multiple machines

 Data integrity
– Checksumming to detect corruption of stored data
– Each chunkserver independently verifies the integrity

 Diagnostic logs
– Chunkservers going up and down
– RPC requests and replies

34
Current status

 Two clusters within Google

– Cluster A: R & D
 Read and analyze data, write result back to cluster
 Much human interaction
 Short tasks

– Cluster B: Production data processing

 Long tasks with multi-TB data
 Seldom human interaction

35
Implications for Applications

 Applications can use checksums to decide which areas readers can

access
 Primary could work on identifying that this is an old failed request and
try assigning same number
 Can eliminate damanged secondaries permanently
 If primary crashes after sending information to some secondaries then
they should sync up
 In read: read can happen from any replica including secondary.
Secondary could be stale.

36
Measurements
 Read rates much higher than write rates
 Both clusters in heavy read activity
 Cluster A supports up to 750MB/read, B: 1300 MB/s
 Master was not a bottle neck

Cluster A B
Read rate (last minute) 583 MB/s 380 MB/s
Read rate (last hour) 562 MB/s 384 MB/s
Read rate (since restart) 589 MB/s 49 MB/s
Write rate (last minute) 1 MB/s 101 MB/s
Write rate (last hour) 2 MB/s 117 MB/s
Write rate (since restart) 25 MB/s 13 MB/s
Master ops (last minute) 325 Ops/s 533 Ops/s
Master ops (last hour) 381 Ops/s 518 Ops/s
Master ops (since restart) 202 Ops/s 347 Ops/s

37
Implementation –
Snapshot*

 Goals
– To quickly create branch copies of huge data sets
– To easily checkpoint the current state

 Copy-on-write technique
– Metadata for the source file or directory tree is duplicated
– Reference count for chunks are incremented
– Chunks are copied later at the first write

38
Measurements

 Recovery time (of one chunkserver)

– 15,000 chunks containing 600GB are restored in
23.2 minutes (replication rate  400MB/s)

39
Review
 High availability and component failure
– Fault tolerance, Master/chunk replication, HeartBeat, Operation Log,
Checkpointing, Fast recovery

 TGs of Space
– 100s of chunkservers, 1000s of disks

 Networking
– Clusters and racks

 Scalability
– Simplicity with a single master
– Interaction between master and chunkservers is minimized

40
Review
 Multi-GB files
– 64MB chunks

 Sequential reads
– Large chunks, cached metadata, load balancing

 Appending writes
– Atomic record appends

 High sustained bandwidth

– Data pipelining
– Chunk replication and placement policies
– Load balancing

41
Benefits and Limitations

 Simple design with single master

 Fault tolerance
 Custom designed
 Only viable in a specific environment
 Limited security

42
Conclusion

 Different than previous file systems

 Satisfies needs of the application
 Fault tolerance

43
GFS Publication:
https://static.googleusercontent.com/media/research.google.com/en//archiv
e/gfs-sosp2003.pdf
MIT Topic discussion:
https://pdos.csail.mit.edu/6.824/papers/gfs-faq.txt
DFS: https://www.youtube.com/watch?v=xoA5v9AO7S0&t=1s

MFG Pro Eb21 Installation Guide Progress Database
No ratings yet
MFG Pro Eb21 Installation Guide Progress Database
183 pages
UNIT 9 Information and Communication Technology MCQ S BooK
No ratings yet
UNIT 9 Information and Communication Technology MCQ S BooK
119 pages
Cambridge IGCSE: Computer Science 0478/12
No ratings yet
Cambridge IGCSE: Computer Science 0478/12
12 pages
PERFORMING COMPUTER OPERATIONS (PCO) LO4 (TLE - IACSS9-12PCO-If-7)
63% (8)
PERFORMING COMPUTER OPERATIONS (PCO) LO4 (TLE - IACSS9-12PCO-If-7)
12 pages
Exadata Storgae Layout
No ratings yet
Exadata Storgae Layout
4 pages
Computer Fundamentals Application PRELIM Module
100% (1)
Computer Fundamentals Application PRELIM Module
24 pages
Lecture 25: Distributed File Systems: Indranil Gupta (Indy)
No ratings yet
Lecture 25: Distributed File Systems: Indranil Gupta (Indy)
27 pages
7SG16 - Ohmega 311 Complete Technical Manual
No ratings yet
7SG16 - Ohmega 311 Complete Technical Manual
156 pages
Distributed File Systems
No ratings yet
Distributed File Systems
107 pages
Unit 2 PDF
No ratings yet
Unit 2 PDF
22 pages
LGF DOC V3 0 1 en
No ratings yet
LGF DOC V3 0 1 en
101 pages
Unit-3 Part1
No ratings yet
Unit-3 Part1
57 pages
EmployabilitySkills IX
No ratings yet
EmployabilitySkills IX
4 pages
1Z0-1195-24 - Oracle Cloud Infrastructure 2024 Data Foundations Associate
No ratings yet
1Z0-1195-24 - Oracle Cloud Infrastructure 2024 Data Foundations Associate
4 pages
Distributed File Systems
No ratings yet
Distributed File Systems
50 pages
Lecture 4.1 - Hadoop - MapReduce - Hbase
No ratings yet
Lecture 4.1 - Hadoop - MapReduce - Hbase
94 pages
Xii Computer Sci Notes
No ratings yet
Xii Computer Sci Notes
86 pages
05 en Distributed File Systems
No ratings yet
05 en Distributed File Systems
63 pages
The Google File System: Kenneth Chiu
No ratings yet
The Google File System: Kenneth Chiu
40 pages
CC - Lecture 8-Final
No ratings yet
CC - Lecture 8-Final
51 pages
Distributed File System
No ratings yet
Distributed File System
68 pages
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
No ratings yet
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
51 pages
Google File System
No ratings yet
Google File System
48 pages
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
No ratings yet
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
52 pages
Chap 6
No ratings yet
Chap 6
54 pages
Quarter 3 - Module 1: Media-Based Arts and Design in The Philippines
No ratings yet
Quarter 3 - Module 1: Media-Based Arts and Design in The Philippines
33 pages
Reliable Distributed Systems
No ratings yet
Reliable Distributed Systems
44 pages
Google File System 1
No ratings yet
Google File System 1
48 pages
Other File Systems: LFS, NFS, and Afs
No ratings yet
Other File Systems: LFS, NFS, and Afs
37 pages
The Google File System Final
No ratings yet
The Google File System Final
20 pages
15 Gfs
No ratings yet
15 Gfs
40 pages
2 GFS
No ratings yet
2 GFS
30 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
Chapter 2 Google File System 250525 070947
No ratings yet
Chapter 2 Google File System 250525 070947
42 pages
Google Fs
No ratings yet
Google Fs
35 pages
DES-3611.prepaway - Premium.exam.65q: Number: DES-3611 Passing Score: 800 Time Limit: 120 Min File Version: 1.1
No ratings yet
DES-3611.prepaway - Premium.exam.65q: Number: DES-3611 Passing Score: 800 Time Limit: 120 Min File Version: 1.1
22 pages
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
No ratings yet
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
33 pages
Distributed File Systems
No ratings yet
Distributed File Systems
28 pages
Lecture 14 HDFS GFS
No ratings yet
Lecture 14 HDFS GFS
30 pages
WELMEC Guide 7.3 v2020
No ratings yet
WELMEC Guide 7.3 v2020
28 pages
18-Distributed File Systems Study On Operating Systems
No ratings yet
18-Distributed File Systems Study On Operating Systems
24 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
Ds 2016 17 Lec18
No ratings yet
Ds 2016 17 Lec18
26 pages
M4 - 05 - Google File System
No ratings yet
M4 - 05 - Google File System
28 pages
Lecture 08
No ratings yet
Lecture 08
25 pages
Chapter 2 1712934164766
No ratings yet
Chapter 2 1712934164766
21 pages
The Google File System: Firas Abuzaid
No ratings yet
The Google File System: Firas Abuzaid
22 pages
Interleaved Memory Organisation, Associative Memo
No ratings yet
Interleaved Memory Organisation, Associative Memo
19 pages
Distributed Computing Module 5 Important Topics PYQs
No ratings yet
Distributed Computing Module 5 Important Topics PYQs
23 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
R16 4-1 BDA - Unit-2 (Ref-3)
No ratings yet
R16 4-1 BDA - Unit-2 (Ref-3)
22 pages
Unit 5 Lecture 2
No ratings yet
Unit 5 Lecture 2
22 pages
11 Distributed File Systems
No ratings yet
11 Distributed File Systems
16 pages
BDA Unit I
No ratings yet
BDA Unit I
18 pages
Unit 2
No ratings yet
Unit 2
22 pages
The Google File System
No ratings yet
The Google File System
21 pages
Google File System
No ratings yet
Google File System
20 pages
Case Study: Google File System
No ratings yet
Case Study: Google File System
7 pages
Vmware Esx3I Guide Document Version: 1.1: RTFM Education LTD
No ratings yet
Vmware Esx3I Guide Document Version: 1.1: RTFM Education LTD
25 pages
Gfs Google File System 13331
No ratings yet
Gfs Google File System 13331
28 pages
A Seminar Report: " Nano Technology "
No ratings yet
A Seminar Report: " Nano Technology "
18 pages
CSC 252 Study Questions 2
No ratings yet
CSC 252 Study Questions 2
21 pages
Paper Gfs Summary
No ratings yet
Paper Gfs Summary
14 pages
Asia
No ratings yet
Asia
14 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Google File System
No ratings yet
Google File System
22 pages
Running Open Ha Cluster With Virtualbox: Combining Technologies To Work
No ratings yet
Running Open Ha Cluster With Virtualbox: Combining Technologies To Work
11 pages
Unit-II (BIG DATA)
No ratings yet
Unit-II (BIG DATA)
9 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
Data Storage
No ratings yet
Data Storage
28 pages
Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
No ratings yet
Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
9 pages
Hardware and Software A Comparison
No ratings yet
Hardware and Software A Comparison
9 pages
Datasheet HD EXOS
No ratings yet
Datasheet HD EXOS
5 pages
Computer Hardware
No ratings yet
Computer Hardware
2 pages
Assisnment # 1 Os
No ratings yet
Assisnment # 1 Os
7 pages
Assisnment # 1 Os
No ratings yet
Assisnment # 1 Os
6 pages
2201540impb A1
No ratings yet
2201540impb A1
7 pages
Distributed File Systems
No ratings yet
Distributed File Systems
6 pages
Create Bootable USB Flash Drive To Install Windows 10 - Tutorials
No ratings yet
Create Bootable USB Flash Drive To Install Windows 10 - Tutorials
2 pages
TR10 2301
No ratings yet
TR10 2301
6 pages
DD (Unix) : DD Is A Common Unix Program Whose Primary Purpose Is The Low-Level Copying and
No ratings yet
DD (Unix) : DD Is A Common Unix Program Whose Primary Purpose Is The Low-Level Copying and
7 pages
Requirements For Distributed File Systems
No ratings yet
Requirements For Distributed File Systems
4 pages
Chapter 11: Distributed File Systems
No ratings yet
Chapter 11: Distributed File Systems
6 pages
A Review On GOOGLE File System
No ratings yet
A Review On GOOGLE File System
4 pages
GFD Summary
No ratings yet
GFD Summary
3 pages
What Is Distributed Data Processing?
No ratings yet
What Is Distributed Data Processing?
2 pages
Bscit 105
No ratings yet
Bscit 105
2 pages
Hirens bootCD 10.4
No ratings yet
Hirens bootCD 10.4
4 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet

Distributed File System Google File System

Uploaded by

Distributed File System Google File System

Uploaded by

Distributed File System

Google File System

 Single node architecture that looked at data

 Nodes can fail. If a node doesn’t fail for 3

DFS: takes care of storing data taking care of

 Need for a scalable DFS

• High sustained bandwidth

 Files are divided into chunks

 Fixed-size chunks (64MB)

 Replicated over chunkservers, called replicas

 Unique 64-bit chunk handles

 Chunks as Linux files

 Contact single master

Unused storage reclaim :

In a HeartBeat message regularly exchanged with the master, each

Replication created when number of replicas less than 3 if current replica is

 Client Asks master with filename and offset

 Client caches all this information for repeated use

 Client contacts one of chunk servers to get the data.

 Relaxed consistency model

 States of a file region after a mutation

 Client asks master replica information which he caches. Also gives

 Secondary executes writes in serial order and confirms back to

 Serial writes – defined.

 Concurrent writes - If everyone says yes- consistent but still could be

- Concurrent writes – each gets a start index concurrently but the

 Master uses leases to maintain a consistent mutation order

 Primary is the chunkserver who is granted a chunk lease

 All others containing replicas are secondaries

 Primary defines a mutation order between mutations

 All secondaries follows this order

 The client specifies only the data

 GFS appends data to the file at least once atomically

 When data won’t fit in last chunk:

• If record append fails at any replica, client retries

 Orphaned chunks (unreachable chunks)

 contains historical records of metadata changes

 replicated on multiple remote machines

 kept small by creating checkpoints

 Two clusters within Google

– Cluster B: Production data processing

 Applications can use checksums to decide which areas readers can

 Recovery time (of one chunkserver)

 High sustained bandwidth

 Simple design with single master

 Different than previous file systems

You might also like