0% found this document useful (0 votes)
17 views4 pages

36 DC Expt9

Uploaded by

greeshmahedvikar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

36 DC Expt9

Uploaded by

greeshmahedvikar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

St.

Francis Institute of Technology


SV Road, Borivali (West), Mumbai 400103

Department of Computer Engineering

Academic Year: 2023-2024 Semester: VIII


Subject: Distributed Computing Class / Division: BE/CMPN A2
Name: Vedant Pednekar Roll Number: 36

Experiment No.: 09

Aim: Case Study: Distributed File System

Pre-requisites: Distributed File System

Theory:
Give the overview of the Distributed File System and explain any one case study of DFS -
1. Google File System
Google File System (GFS) is a distributed file system designed by Google to cater to the
storage requirements of their extensive, data-centric applications. Introduced in a seminal
paper by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung in 2003, GFS adopts
a chunk-based architecture, breaking files into fixed-size chunks and employing a
master-chunk server model. The master server maintains metadata and control, while
multiple chunk servers store data with fault tolerance achieved through replication. GFS
prioritizes scalability, fault tolerance, and high throughput, making it suitable for
large-scale data processing. Operations involve file creation, read/write processes, and
chunk migration, with the system designed to gracefully handle failures. Despite potential
concerns such as a single point of failure and limited POSIX compliance, GFS has
significantly influenced the development of distributed file systems, leaving an indelible
mark on the storage solutions landscape and serving as the backbone for Google's
data-intensive applications.

Key Features of GFS:

● Scalability: GFS is designed to scale horizontally, allowing it to accommodate the


ever-growing volume of data and increasing user demands. This scalability is achieved by
adding more commodity hardware to the system.
● Fault Tolerance: GFS ensures fault tolerance through the replication of data across multiple
chunk servers. Each chunk, a fixed-size unit of data, is replicated on different servers,
minimizing the impact of hardware failures and enhancing data reliability.
● High Throughput: The system is optimized for high-throughput access to large datasets.
This is particularly important for data-intensive applications that require efficient processing
of vast amounts of data, and GFS achieves this through its design and operation.
● Streaming Writes: GFS is well-suited for applications with significant write-intensive
workloads, such as those involving large-scale log processing. It efficiently handles streaming
writes, contributing to its performance in scenarios where data is continuously generated.
● Chunk-Based Architecture: GFS organizes data into fixed-size chunks (typically 64 MB).
This chunk-based architecture facilitates efficient storage and retrieval of data and simplifies
the management of large datasets.
● Master-Chunk Server Model: GFS follows a master-chunk server model, where a central
master server oversees the metadata and control of the file system, while multiple chunk
servers store and manage the actual data. This separation of responsibilities contributes to the
system's scalability and manageability.
● Relaxed Consistency Model: GFS employs a relaxed consistency model, prioritizing high
throughput and availability over strict consistency. This allows for quicker data access and
manipulation, especially in scenarios involving concurrent read and write operations.
● Data Migration and Balancing: The system supports the migration of data chunks between
servers for load balancing and optimization purposes. The master server manages this
process, ensuring that data is efficiently distributed across the available resources.

GFS Design Requirements

● High Fault Tolerance: In the context of Google File System (GFS), high fault tolerance is a
fundamental design principle to address the challenges inherent in large-scale distributed
systems. The emphasis on fault tolerance is motivated by the acknowledgement that hardware
failures are inevitable in such expansive infrastructures. GFS achieves fault tolerance through
data replication—each chunk of data is replicated across multiple chunk servers. In the event
of a hardware failure or node crash, the system can seamlessly redirect operations to
available replicas, ensuring continuous data accessibility and system functionality. This
design choice is crucial for maintaining the integrity and reliability of data, safeguarding
against potential disruptions caused by hardware or server failures.
● High Performance: GFS is engineered to deliver high-performance access to large datasets,
aligning with the demands of Google's data processing applications. The architecture is
optimized to support both high-speed read and write operations, making it well-suited for
scenarios where large volumes of data need to be processed efficiently. By prioritizing
performance, GFS contributes to the rapid and effective execution of tasks such as web
indexing, search operations, and other data-intensive processes. This focus on high
performance is a key factor in enhancing the overall efficiency and responsiveness of the file
system.
● Scalability: Scalability is a pivotal requirement for GFS due to the continuously expanding
volumes of data generated and processed by Google's applications. The system is designed to
scale horizontally, allowing for seamless expansion by adding more commodity hardware to
the distributed infrastructure. This scalability ensures that GFS can handle the ever-increasing
storage needs and processing demands, making it adaptable to the dynamic and growing
nature of Google's data-centric operations. The distributed nature of GFS, coupled with its
scalability, enables it to effectively manage massive datasets without sacrificing performance
or reliability.
● Simplicity: Simplicity in both design and operations is a guiding principle for GFS. This
design choice is intended to ensure easy manageability and maintenance of the distributed file
system. By favoring simplicity, GFS reduces the likelihood of errors, streamlines system
operations, and enhances overall reliability. The file system's simplicity is evident in its
architecture, file organization, and mechanisms for data access. This approach makes GFS
more robust and user-friendly, contributing to its successful integration into Google's
infrastructure and facilitating efficient management of large-scale data processing tasks.

GFS Architecture

Components of GFS
A group of computers makes up GFS. A cluster is just a group of connected computers. There
could be hundreds or even thousands of computers in each cluster. There are three basic entities
included in any GFS cluster as follows:

GFS Clients: They can be computer programs or applications which may be used to request
files. Requests may be made to access and modify already-existing files or add new files to the
system.

GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s
actions in an operation log. Additionally, it keeps track of the data that describes chunks, or
metadata. The chunks’ place in the overall file and which files they belong to are indicated by
the metadata to the master server.

GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks.
The master server does not receive any chunks from the chunk servers. Instead, they directly
deliver the client the desired chunks. The GFS makes numerous copies of each chunk and stores
them on various chunk servers to assure stability; the default is three copies. Every replica is
referred to as one.

GFS system mainly stores 2 types of data

File Metadata: The file metadata in GFS consists of information about the structure and
properties of files. This metadata includes details such as file names, directory structures, access
permissions, and timestamps. The master server in the GFS architecture is responsible for
managing and maintaining this metadata. It keeps track of the location of chunks that make up a
file, the number of replicas for each chunk, and other relevant attributes. The file metadata is
crucial for organizing and managing the file system's structure.

File Data: The actual content of the files, referred to as file data, is stored in GFS. This data is
divided into fixed-size chunks, typically 64 megabytes in size. Each chunk is treated as a
separate unit for storage and retrieval purposes. The chunk servers are responsible for storing
and managing these chunks of file data. Data replication is used to ensure fault tolerance, with
multiple replicas of each chunk distributed across different chunk servers. This redundancy
allows for continued data access even in the event of hardware failures or other issue

Advantages of GFS

● High accessibility: Data is still accessible even if a few nodes fail. (replication)
Component failures are more common than not, as the saying goes.
● Excessive throughput: many nodes operating concurrently.
● Dependable storing: Data that has been corrupted can be found and duplicated.

Disadvantages of GFS

● Not the best fit for small files.


● The master may act as a bottleneck.
● unable to type at random.
● Suitable for procedures or data that are written once and only read (appended) later.

You might also like