0% found this document useful (0 votes)
4 views37 pages

A Distributed File System: By, Prof Ankita Mandore

A Distributed File System (DFS) allows users to access and manage files across multiple servers as if they were local, enhancing redundancy and reliability. Key features include remote information sharing, user mobility, availability, and support for diskless workstations, with components such as storage service, true file service, and name service. The architecture ensures scalability, fault tolerance, and data integrity while providing a user-friendly interface and security mechanisms.

Uploaded by

shivampoddar171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views37 pages

A Distributed File System: By, Prof Ankita Mandore

A Distributed File System (DFS) allows users to access and manage files across multiple servers as if they were local, enhancing redundancy and reliability. Key features include remote information sharing, user mobility, availability, and support for diskless workstations, with components such as storage service, true file service, and name service. The architecture ensures scalability, fault tolerance, and data integrity while providing a user-friendly interface and security mechanisms.

Uploaded by

shivampoddar171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

A Distributed File System

By, Prof Ankita Mandore


What is Distributed File System?
 A Distributed File System (DFS) is a file system that is distributed on multiple
file servers or multiple locations.
 It allows programs to access or store isolated files as they do with the local
ones, allowing programmers to access files from any network or computer.
 A Distributed File System (DFS) makes it convenient for the users of a
distributed system to use files in a distributed environment.
 As compared to traditional file system used on a standalone machine, DFS
introduces more complexities because the users and devices are physically
distributed across locations.
 Apart from permanent storage and information sharing, a DFS should provide
additional features, such as remote information sharing, user mobility,
availability, and support for diskless workstations.
 A distributed file system (DFS) is a networked architecture that allows
multiple users and applications to access and manage files across various
machines as if they were on a local storage device. Instead of storing data on
a single server, a DFS spreads files across multiple locations, enhancing
redundancy and reliability.
The major functions
Functions of DFS of a DFS include
permanent storage,
remote information
 The first feature is remote information sharing, which implies that any file sharing, user mobility,
should be transparently accessed from any node irrespective of the location availability, and
of the file.
support for diskless
 The second feature is user mobility, i.e., the system should be flexible such workstation.
that one can work from any node at any instant of time without the need to
relocate any storage device. This facility suits people who need to access files
while working from different locations at different instances of time.
 The third feature is availability, which implies that the files should always be
available in spite of any temporary failure. The DFS hence must maintain
multiple copies on different nodes, which are called the replicas. The number
of replicas is hidden from the users.
 DFS should also support diskless workstations. These workstations are
economical, less noisy, and generate less heat because they allow disks to be
separated from the workstation. An array of RAID can be kept at a central
location as a repository of files. A DFS should support remote file access
capability, transparently.
Components of DFS
 The major components of a DFS are storage service, true file service, and
name service.
 Storage service is related to the allocation and management of space on the
secondary storage. It provides a logical view of storage to the users. This is
possible by providing operations for storage and retrieval of data.
 True file service provides operations on individual files, such as access,
modification, creation, deletion, etc. To carry out these operations correctly
and efficiently, issues, such as file access mechanism, file sharing semantics,
file caching, file replication, concurrency control, data consistency, etc.,
need to be considered. All these issues are discussed in subsequent sections.
 Name service is another component which enables the users to identify files
easily with text names which are mapped to internal file IDs used to locate
the files. This service is also called directory service, and performs
operations, such as create, delete, add, etc.
Desirable Features of a Good DFS
 Transparency
 Structure transparency: There is no need for the client to know about the number
or locations of file servers and the storage devices. Multiple file servers should be
provided for performance, adaptability, and dependability.
 Access transparency: Both local and remote files should be accessible in the same
manner. The file system should be automatically located on the accessed file and
send it to the client’s side.
 Naming transparency: There should not be any hint in the name of the file to the
location of the file. Once a name is given to the file, it should not be changed
during transferring from one node to another.
 Replication transparency: If a file is copied on multiple nodes, both the copies of
the file and their locations should be hidden from one node to another.
 User mobility
 Another desired feature is user mobility, i.e. users should be able to view or access
the file system from any node at any time and the system should exhibit the same
performance.
 This feature can be achieved by bringing the user's environment to the node where
the user is trying to log in.
 Performance
 It is the next important feature, which is measured as the average amount of time
needed to satisfy a user's request for access to a file. In a conventional system, it
is equal to the summation of the time required to access the secondary storage
and the CPU processing time. However, in a DFS, the additional component of
network communication is added along with the time required to access that
machine's secondary storage device, if it is a remote file.
 Ease of use
 A good DFS system should be simple and easy to use. The semantics and the
interface should be easy, user friendly, and should support a large number of
applications.
 Scalability
 Most distributed systems span across locations. Hence, a good DFS should be
scalable and should support the growth of nodes and users, without disruption of
service or loss of performance.
 Availability
 A DFS system should also be available. This implies that the system should continue
to function even in case of partial failure. Some degradation may be allowed, but
the entire system should not break down. One of the solutions to tackle such a
breakdown is to replicate the files.
 Reliability
 Tagged with these features is another desired feature: reliability. It is important to
minimize the loss of stored data, thus reducing the load on users to create their
own backups. The system should create backups automatically, which will prove
useful in the event of loss of original files.
 Integrity of data
 A good DFS should ensure the integrity of data. Concurrency control mechanisms
must be used to allow multiple users to access files. Atomic transactions should
also be used to maintain the data integrity of files.
 Security
 Since the DFS will be used by many users, it is important to maintain security.
Appropriate security mechanisms must be implemented so that the users are
confident of the privacy of their data. Security mechanisms must be implemented
to prevent unauthorized access.
 Support for heterogeneous systems
 A distributed system comprising heterogeneous machines provides flexibility to the
users to work on different computing platforms and applications. Hence, a DFS
must support users working in a heterogeneous environment. All such machines
should be able to share files and integrate different storage media. If any DFS
possesses all the desirable features, the distributed system becomes efficient and
easy to use.
File Service Architecture in Distributed
System

 File service architecture in distributed systems manages and provides access


to files across multiple servers or locations.
 It ensures efficient storage, retrieval, and sharing of files while maintaining
consistency, availability, and reliability.
 By using techniques like replication, caching, and load balancing, it addresses
data distribution and access challenges in a scalable and fault-tolerant
manner.
Importance of File Service Architecture in
Distributed Systems
 Scalability: File service architectures are designed to scale horizontally,
accommodating increasing amounts of data and a growing number of clients without a
significant drop in performance.
 Fault Tolerance: By incorporating redundancy and data replication, these
architectures ensure data availability and reliability, even in the event of hardware
failures or network issues.
 Consistency and Integrity: Advanced file service systems implement consistency
models to ensure that all clients have a coherent view of the data, maintaining data
integrity across the distributed environment.
 High Availability: Through techniques like load balancing and failover mechanisms, file
service architectures provide continuous availability of data, which is crucial for
applications that require real-time access and minimal downtime.
 Performance Optimization: By utilizing caching, data partitioning, and efficient
access protocols, file service architectures enhance performance, reducing latency and
increasing throughput for data-intensive applications.
 Data Management and Organization: These systems provide structured data storage
and access, facilitating easy data management and retrieval, which is essential for
large-scale applications and big-data analytics.
 Flexibility and Adaptability: They offer flexible storage solutions that can be tailored
to various application needs, supporting diverse data types and access patterns, which
is crucial for modern, dynamic computing environments.
File Service Interface
Based on method of remote file access
 Remote access model
 The user's request is performed at the server's node, and a copy of the file is
returned to the user.
 The request and response messages are transferred across the network as data
packets along with communication overheads
 A remote file service model's interface and the communication protocols must be
designed carefully to minimize the overheads attached to the number of messages
required to satisfy the request.
 Typical examples of this type of access model are Locus and Network File System
(NFS). Remote file access always results in network traffic.
Data caching model
 Data caching model is also known as the upload/download
model
 This model uses the locality feature of data access.
 On the first request for a file from the user, the data is brought
to the user's node, cached, and subsequent requests are
satisfied locally.
 Cache can be made available using Least Recently Used (LRU)
replacement policy. This access model is implemented in Sprite
distribute system
 As compared to the remote access model, the data caching
model reduces network traffic.
 However, caching gives rise to consistency issues in case
multiple users start writing to the same file.
 Caching is commonly used in most DFS because of its
advantages, such as increased performance and greater
scalability.
 When a user requests for a file, the unit of data transfer refers
to the part of data transferred in a single read or write
operation.
Unit of data access
 Based on the unit of data transfer, the various data transfer models are: file-
level transfer, block-level transfer, byte-level transfer, and record-level
transfer.
File Service Architecture
 File Service Architecture is an architecture that provides the facility of file
accessing by designing the file service as the following three components:
 A client module
 A flat file service
 A directory service
1. Flat file service
 A flat file service is used to perform operations on the contents of a file.
 The Unique File Identifiers (UFIDs) are associated with each file in this service.
For that long sequence of bits is used to uniquely identify each file among all of
the available files in the distributed system.
 When a request is received by the Flat file service for the creation of a new file
then it generates a new UFID and returns it to the requester.
 Flat File Service Model Operations:
 Read(FileId, i, n) -> Data: Reads up to n items from a file starting at item ‘i’ and
returns it in Data.
 Write(FileId, i, Data): Write a sequence of Data to a file, starting at item I and
extending the file if necessary.
 Create() -> FileId: Creates a new file with length 0 and assigns it a UFID.
 Delete(FileId): The file is removed from the file store.
 GetAttributes(FileId) -> Attr: Returns the file’s file characteristics.
 SetAttributes(FileId, Attr): Sets the attributes of the file.
2. Directory Service
 The directory service serves the purpose of relating file text names with their
UFIDs (Unique File Identifiers).
 The fetching of UFID can be made by providing the text name of the file to
the directory service by the client.
 The directory service provides operations for creating directories and adding
new files to existing directories.
 Directory Service Model Operations:
 Lookup(Dir, Name) -> FileId : Returns the relevant UFID after finding the text
name in the directory. Throws an exception if Name is not found in the directory.
 AddName(Dir, Name, File): Adds(Name, File) to the directory and modifies the
file’s attribute record if Name is not in the directory. If a name already exists in
the directory, an exception is thrown.
 UnName(Dir, Name): If Name is in the directory, the directory entry containing
Name is removed. An exception is thrown if the Name is not found in the directory.
 GetNames(Dir, Pattern) -> NameSeq: Returns all the text names that match the
regular expression Pattern in the directory.
3. Client Module
 The client module executes on each computer and delivers an integrated
service (flat file and directory services) to application programs with the help
of a single API.
 It stores information about the network locations of flat files and directory
server processes. Here, recently used file blocks hold in a cache at the client-
side, thus, resulting in improved performance.
Replication in Distributed System
 One of the main goals of a DFS is to improve availability.
 A replicated file has multiple copies located on separate file servers.
 The first major reason for providing replication is to increase reliability by
having independent backups of each file. If one server crashes, the copy can
be taken from another server.
 The second reason for replication is to enable file access to continue even if
one file server is down.
 The objective is that the entire system should not break down during a crash
of a file server.
 Replication allows the workload to be distributed among multiple servers if
any one of the servers is overloaded, thus improving performance.
Unit of replication
 In DFS, replication unit can vary based on size or group of files, namely
complete file or block, volume, or pack.
 Complete file or block Complete file or block is replicated, on demand, only
when the data is needed. This type of replica management is harder in terms
of locating replicas and ensuring file protection.
 Volume The other unit of file replication is volume (group) of files. This
method is wasteful if some files of the volume are not needed.
 Pack In this method, pack is a subset of files in a user's primary pack, and all
replicas in the pack are updated together. This ensures mutual consistency
among replicas.
Replica Creation
 The replication can be carried out in any of the following three ways, namely
explicit file replication, lazy file replication, or file replication using a group.
Explicit file replication
 In this method, the entire process is controlled by the programmer. A process
always makes copy (C) of the file on one server, and then it can make
multiple copies to be resident on other servers (S₁, S2, S3).
 The directory server can maintain a list of all replicas and network addresses
for the files.
 When the name is looked up in the directory, all replicas are listed and each
copy can be found.
 When a file is requested, any one of these copies can be opened.
Lazy file replication
 In this method, only one copy (C) is created on the server (S2), and later, this
server makes replicas on servers S1 and S3.
 The system can track all the replicas and retrieve any one of the copies as
required.
 These copies are actually made in the background, and there is a chance that
the file may change before the copy is made.
File replication using a group

 The third method is to carry out file replication using groups. In this method,
a Write system call is sent to all the servers (S1, S2, S3), and multiple replicas
are created when the original is made.
 In lazy replication, only one server is addressed-not the entire group-and it
happens in the background when the server is free; while in the group
mechanism, all the copies are made at the same time.
 Each of these three methods has its own advantages and disadvantages but all
the methods provide transparency.
Update Protocols
 Consider a scenario shown in Figure, where a client is connected to three file
servers each storing a replica of file f. Suppose the client appends the file in
its memory. Now the file updates have to be sent to all servers so that the
replicas are consistent. To do this the client sends a write f message to all
servers: S1, S2, and S3 which have replica of file f. If the client crashes after
sending the first two writes to S₁ and S2, all the replicas are not consistent.
Hence the read operations on these servers now will give different values.
Primary copy algorithm
 In this algorithm, one server is designated as primary and all other
servers are termed as secondary, as depicted in Figure 9-18(b).
When a replicated file is to be updated, the changes are sent only
to the primary server (1). It makes the changes locally and writes
to stable storage. This avoids problems in case of primary crash
and sends an update command to all the secondary to make the
corresponding changes. Read operations can now be done either
from the primary or the secondary servers, thus, balancing load.
There is a possibility that the primary server crashes before it has
propagated the changes to all the secondary. Hence, the updates
are always written to a stable storage. When a server reboots,
checks are made to see whether any updates were in progress at
the time of crash. If so, then the updates are carried out, and
later, all secondary are updated. The only disadvantage is that no
updates are performed if the primary is down.
Voting algorithm
 A voting algorithm is proposed to overcome the problems
with a single primary, which requires that the client
should request permission from more than one server
before performing a Read or Write operation on files and
their replicas. In the simple majority voting algorithm,
only votes from most current replicas are valid. As shown
in Figure 9-19, consider a client and a five server
distributed system. Each server, S1, S2, S3, S4, and S5,
stores a copy of file fifth a version number, f2 (S1, S2,
S3) or f3 (S4, S5). Now the client wants to write in file
f₂. The following steps are carried out: 1. Client sends a
message to ask the version number from a majority of
the servers (S1, S2, S3). 2. Those servers send a reply
message specifying the current file version-f₂. Client
understands that the files are updated in majority of the
servers. 3. Now client can update the file stored on each
server: S1, S2, S3, S4, and S5. Such voting is carried out
for both Read and Write operations by the client.
Two Phase Commit Protocol (Distributed
Transaction Management)

 The Two-Phase Commit Protocol (2PC) is a widely used approach to ensure


atomic transactions in distributed systems.
 In 2PC, a coordinator oversees multiple participants and ensures that either
all participants agree to commit the transaction, or if any participant fails to
commit, all participants rollback.
How 2PC Works
 Preparation Phase:
 The coordinator sends a Prepare request to all participants.
 Each participant checks if it can complete the transaction and responds with either
a Yes (prepared) or No (abort).
 If any participant responds with a No, the process halts, and the transaction is
rolled back.
 2. Commit Phase:
 If all participants respond with Yes, the coordinator sends a Commit command, and
all participants commit the transaction.
 If any participant responds with No, the coordinator sends an Abort command to
rollback the transaction across all participants.
Example Scenario for 2PC

 Imagine an e-commerce website handling an order transaction:


 When a customer places an order, the e-commerce system needs to check
the inventory (whether items are in stock), payment processing (funds
availability), and delivery system (delivery partner availability).
 The coordinator (order system) sends a Prepare request to each component.
 If all components confirm they are ready, the coordinator proceeds with
Commit. If any component indicates an issue (e.g., insufficient stock), the
coordinator aborts the transaction to maintain consistency.
Limitations of 2PC

 2PC does not handle network failures very well:


 If the coordinator fails during the Commit Phase, some participants may not
receive the final decision.
 It can result in blocking (participants waiting indefinitely), especially if the
coordinator fails to send a response.

You might also like