0% found this document useful (0 votes)
13 views9 pages

Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS

Uploaded by

Maaz Sayyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS

Uploaded by

Maaz Sayyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

GFS

• Google File System (GFS) is designed to meet the rapidly growing demands of Google’s data
processing needs.
• GFS shares many of the same goals as previous distributed file systems such as performance,
scalability, reliability, and availability.
• The Google File System (GFS) is a proprietary DFS developed by Google. It is designed to
provide efficient, reliable access to data using large clusters of commodity hardware.
• A GFS cluster consists of a single master and multiple chunk servers and is accessed by
multiple clients.
• Each of these is typically a commodity Linux machine running a user-level server process. It
is easy to run both a chunk server and a client on the same machine
• Files are divided into fixed-size chunks. Each chunk is identified by an immutable and
globally unique 64 bit chunkhandle assigned by the master at the time of chunk creation.
• Chunk servers store on local disks as Linux files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each chunk is replicated on multiple servers.
• By default chunks replicated three times.
GFS
• Master:
• The master maintains all file system metadata. This includes the namespace, access
control information, the mapping from files to chunks, and the current locations of
chunks.
• It also controls system-wide activities such as chunk lease management, garbage
collection of orphaned chunks, and chunk migration between chunkservers.
• The master periodically communicates with each chunkserver in HeartBeat messages to
give it instructions and collect its state.
• GFS client code linked into each application implements the file system API and
communicates with the master and chunkservers to read or write data on behalf of the
application.
• Clients interact with the master for metadata operations ,but all data-bearing
communication goes directly to the chunkservers.
• Neither the client nor the chunkserver caches file data.
• most applications stream through huge files or have working sets too large to be cached.
• Chunkservers need not cache file data because chunks are stored as local files
Chunk Size

• Chunks size is one of the key design parameters.


• Default size 64 MB, which is much larger than typical file system blocks
sizes.
• Each chunk replica is stored as a plain Linux file on a chunkserver
• Advantages of large chick size:
• it reduces clients’ need to interact with the master because reads and
writes on the same chunk require only one initial request to the master for
chunk location information.
• since on a large chunk, a client is more likely to perform many operations
on a given chunk, it can reduce network overhead
• it reduces the size of the metadata stored on the master.
Metadata

• The master stores three major types of metadata: the file and
chunknamespaces, the mapping from files to chunks,and the locations of
each chunk’s replicas.
• All metadata is kept in the master’s memory.
• The first two types (namespaces and file-to-chunkma pping) are also kept
persistent whereas chunk location information is not stored persistently.
• Instead, it asks each chunkserver about its chunks at master startup and
whenever a chunkserver joins the cluster.
Read Algorithm
Read Algorithm
1. Application originates the read request.
2. GFS client translates the request from (filename, byte range) -> (filename,
chunk index), and sends it to master.
3. Master responds with chunk handle and replica locations (i.e. chunkservers
where the replicas are stored).
4. Client picks a location and sends the (chunk handle, byte range) request to
that location.
5. Chunkserver sends requested data to the client.
6. Client forwards the data to the application.
Write Algorithm
Write Algorithm

1. Application originates write request.


2. GFS client translates request from (filename, data) -> (filename, chunk index), and
sends it to master.
3. Master responds with chunk handle and (primary + secondary) replica locations.
4. Client pushes write data to all locations. Data is stored in chunkservers’ internal buffers.
5. Client sends write command to primary.
6. Primary determines serial order for data instances stored in its buffer and writes the
instances in that order to the chunk.
7. Primary sends serial order to the secondaries and tells them to perform the write.
8. Secondaries respond to the primary.
9. Primary responds back to client.
• Note: If write fails at one of chunkservers, client is informed and retries the write.

You might also like