Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
• Google File System (GFS) is designed to meet the rapidly growing demands of Google’s data
processing needs.
• GFS shares many of the same goals as previous distributed file systems such as performance,
scalability, reliability, and availability.
• The Google File System (GFS) is a proprietary DFS developed by Google. It is designed to
provide efficient, reliable access to data using large clusters of commodity hardware.
• A GFS cluster consists of a single master and multiple chunk servers and is accessed by
multiple clients.
• Each of these is typically a commodity Linux machine running a user-level server process. It
is easy to run both a chunk server and a client on the same machine
• Files are divided into fixed-size chunks. Each chunk is identified by an immutable and
globally unique 64 bit chunkhandle assigned by the master at the time of chunk creation.
• Chunk servers store on local disks as Linux files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each chunk is replicated on multiple servers.
• By default chunks replicated three times.
GFS
• Master:
• The master maintains all file system metadata. This includes the namespace, access
control information, the mapping from files to chunks, and the current locations of
chunks.
• It also controls system-wide activities such as chunk lease management, garbage
collection of orphaned chunks, and chunk migration between chunkservers.
• The master periodically communicates with each chunkserver in HeartBeat messages to
give it instructions and collect its state.
• GFS client code linked into each application implements the file system API and
communicates with the master and chunkservers to read or write data on behalf of the
application.
• Clients interact with the master for metadata operations ,but all data-bearing
communication goes directly to the chunkservers.
• Neither the client nor the chunkserver caches file data.
• most applications stream through huge files or have working sets too large to be cached.
• Chunkservers need not cache file data because chunks are stored as local files
Chunk Size
• The master stores three major types of metadata: the file and
chunknamespaces, the mapping from files to chunks,and the locations of
each chunk’s replicas.
• All metadata is kept in the master’s memory.
• The first two types (namespaces and file-to-chunkma pping) are also kept
persistent whereas chunk location information is not stored persistently.
• Instead, it asks each chunkserver about its chunks at master startup and
whenever a chunkserver joins the cluster.
Read Algorithm
Read Algorithm
1. Application originates the read request.
2. GFS client translates the request from (filename, byte range) -> (filename,
chunk index), and sends it to master.
3. Master responds with chunk handle and replica locations (i.e. chunkservers
where the replicas are stored).
4. Client picks a location and sends the (chunk handle, byte range) request to
that location.
5. Chunkserver sends requested data to the client.
6. Client forwards the data to the application.
Write Algorithm
Write Algorithm