0% found this document useful (0 votes)
215 views

Distributed File Systems: Unit - V Essay Questions

This document discusses distributed file systems and contains three questions. Question 1 defines file models as either structured or unstructured and mutable or immutable. It explains that most modern operating systems use the unstructured and mutable models. Question 2 discusses file caching schemes and key decisions around cache location, modification propagation, and cache validation. It provides examples of caching on servers, clients disks, and clients memory. It also covers write-through and delayed write propagation and client-initiated vs server-initiated validation. Question 3 defines atomic transactions as a sequence of operations that perform a single logical function independently of other transactions. It lists the ACID properties and gives programming examples of beginning and ending transactions.

Uploaded by

dinesh9866119219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
215 views

Distributed File Systems: Unit - V Essay Questions

This document discusses distributed file systems and contains three questions. Question 1 defines file models as either structured or unstructured and mutable or immutable. It explains that most modern operating systems use the unstructured and mutable models. Question 2 discusses file caching schemes and key decisions around cache location, modification propagation, and cache validation. It provides examples of caching on servers, clients disks, and clients memory. It also covers write-through and delayed write propagation and client-initiated vs server-initiated validation. Question 3 defines atomic transactions as a sequence of operations that perform a single logical function independently of other transactions. It lists the ACID properties and gives programming examples of beginning and ending transactions.

Uploaded by

dinesh9866119219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Unit – V

Distributed File Systems


Essay Questions
1. What is meant by file and explain the file models?
Two main purposes of using files:
1. Permanent storage of information on a secondary storage media.
2. Sharing of information between applications.
File Models:
(a) Unstructured and Structured files: In the unstructured model, a file is an
unstructured sequence of bytes. The interpretation of the meaning and structure of the data stored
in the files is up to the application (e.g. UNIX and MS-DOS). Most modern operating systems use the
unstructured file model.
In structured files (rarely used now) a file appears to the file server as an ordered sequence
of records. Records of different files of the same file system can be of different sizes.
(b) Mutable and immutable files: Based on the modifiability criteria, files are of two types,
mutable and immutable. Most existing operating systems use the mutable file model. An update
performed on a file overwrites its old contents to produce the new contents.
In the immutable model, rather than updating the same file, a new version of the file is created each
time a change is made to the file contents and the old version is retained unchanged. The problems
in this model are increased use of disk space and increased disk activity.

2. Explain the File Caching Schemes?


Every distributed file system uses some form of caching. The reasons are:
1. Better performance since repeated accesses to the same information is handled
additional network accesses and disk transfers. This is due to locality in file access
patterns.
2. It contributes to the scalability and reliability of the distributed file system since data
can be remotely cached on the client node.
Key decisions to be made in file – caching scheme for distributed systems:
1. Cache location.
2. Modification Propagation.
3. Cache Validation.
1. Cache Location: This refers to the place where the cached data is stored. Assuming that the
original location of a file is on its servers disk, there are three possible cache locations in a
distributed file system:
i. Servers main memory
1) In this case a cache hit costs one network access.
2) It does not contribute to scalability and reliability of the distributed file
system.
3) Since we every cache hit requires accessing the server.
Advantages:
1. Easy to implement.
2. Totally transparent to clients.
3. Easy to keep the original file and the cached data consistent.
ii. Clients disk:
In this case a cache hit costs one disk access. This is somewhat slower than having
the cache in servers main memory. Having the cache in servers main memory is also
simpler.
Advantages:
1. Provides reliability against crashes since modification to cached data is lost in a crash if
the cache is kept in main memory.
2. Large storage capacity.
3. Contributes to scalability and reliability because on a cache hit the access request can be
serviced locally without the need to contact the server.
iii. Clients main memory:
Eliminates both network access cost and disk access cost. This technique is not
preferred to a client’s disk cache when large cache size and increased reliability of cached
data are desired.
Advantages:
1. Maximum performance gain.
2. Permits workstations to be diskless.
3. Contributes to reliability and scalability.
Modification Propagation:
When the cache is located on clients nodes a files data may simultaneously be cached on
multiple nodes. It is possible for caches to become inconsistent when the file data is changed by one
of the clients and the corresponding data cached at other nodes are not changed or discarded.
There are two design issues involved:
1. When to propagate modifications made to a cached data to the corresponding file
server.
2. How to verify the validity of cached data.
The modification propagation scheme used has a critical effect on the systems performance and
reliability. Techniques used include:
(a) Write – through scheme: When a cache entry is modified, the new value is immediately
sent to the server for updating the master copy of the file.
Advantage:
1. High degree of reliability and suitability for UNIX-like semantics.
2. This is due to the fact that the risk of updated data getting lost in the event of a client
crash is very low since every modification is immediately propagated to the server
having the master copy.
Disadvantage:
1. This scheme is only suitable where the ratio of read – to –write accesses is fairly large. It
does not reduce network traffic for writes.
2. This is due to the fact that every write access has to wait until the data is written to the
master copy of the server. Hence the advantages of data caching are only read accesses
because the server is involved for all write accesses.
(b) To reduce network traffic for writes the delayed – write scheme is used. In this case, the
new data value is only written to the cache and all updated cache entries are sent to the
server at a later time. There are three commonly used delayed – write approaches:
(i) Write on ejection from cache:
Modified data in cache is sent to server only when the cache-replacement policy has
decided to eject it form clients cache. This can result in good performance but there can be a
reliability problem since some server data may be outdated for a long time.
(ii) Periodic write:
The cache is scanned periodically and any cached data that has been modifies since
the last scan is sent to the server.
(iii) Write on close:
Modification to cached data is sent to the server when the client closes the file. This
does not help much in reducing network traffic for those files that are open for very short
periods or are rarely modified.

Cache Validation schemes the modification propagation policy only specifies when
the master copy of a file on the server node is updated upon modification of a cache entry. It
does not tell anything about when the file data residing in the cache of other nodes is
updated.

A file data may simultaneously reside in the cache of multiple nodes. A clients cache
entry becomes stale as soon as some other client modifies the data corresponding to the
cache entry in the master copy of the file on the server.

It becomes necessary to verify if the data cached at a client node is consistent with
the master copy. If not, the cached data must be invalidated and the updated version of the
data must be fetched again from the server.

There are two approaches to verify the validity of cached data: The client – initiated
approach and the server – initiated approach.

Client – initiated approach: The client contacts the server and checks whether its locally cached
data is consistent with the master copy. Two approaches may be used:
1. Checking before every access.
This defeats the purpose of caching because the server needs to be contacted on every access.
2. Periodic checking.
A check is initiated every fixed interval of time.

Server – initiated approach:


A client informs the file server when opening a file, indicating whether a file is being opened
for reading, writing, or both. The file server keeps a record or which client has which file open and
in what mode.

So server monitors file usage modes being used by different clients and reacts whenever it
detects a potential for inconsistency. E.g. if a file is open for reading, other clients may be allowed to
open it for reading, but opening it for writing cannot be allowed. So also, a new client cannot open a
file in any mode if the file is open for writing.

When a client closes a file, it sends intimation to the server along with any modifications
made to the file. Then the sever updates its record of which client has which file open in which
mode.

When a new client makes a request to open an already open file and if the server finds that
the new open mode conflicts with the already open mode, the server can deny the request, queue
the request, or disable caching by asking all clients having the file open to remove that file from
their caches.

3. Explain the Atomic Transactions?


A sequence of operations that perform a single logical function, Separate from all other
transactions
Examples:
1. Withdrawing money from your account
2. Making an airline reservation
3. Making a credit – card purchase
A transaction that happens completely or not at all
 No partial results
Example:
1. Cash machine hands you cash and deducts amount from your account.
2. Airline confirms your reservation and.
a. Reduces number of free seats.
b. Charges your credit card.
c. (Sometimes) increases number of meals loaded onto flight.
Fundamental principles – A C I D
1. Atomicity – to outside world, transaction happens indivisibly
2. Consistency – transaction preserves system invariants
3. Isolated – transaction do not interfere with each other
4. Durable – once a transaction “commits”, the changes are permanent
Programming in a Transaction System:
1. Begin_transaction: Mark the start of a transaction.
2. End_transaction: Mark the end of a transaction and try to “commit”.
3. Abort_transaction: Terminate the transaction and restore old values.
4. Read: Read data from a file, table, etc., on behalf of the transaction.
5. Write: Write data to file, table, etc., on behalf of the transaction.
6. Nested Transactions: One or more transactions inside another transaction.
May individually commit, but may need to be undone.
Example:
1. Planning a trip involving three flights:
2. Reservation for each flight “commits” individually.
3. Must be undone if entire trip cannot commit.
Atomic transactions that span multiple sites and/or systems. Same semantics as atomic
transactions on single system
 ACID
Failure modes:
1. Crash or other failure of one site or system
2. Network failure or partition
3. Byzantine failures

4. Explain the Authentication?


Authentication is the process of determining whether someone or something is, in fact, who
or what it is declared to be.
Logically, authentication precedes authorization (although they may often seem to be
combined). The two terms are often used synonymously but they are two different processes.
Message Authentication. In this threat, the user is not sure about the originator of the
message. Message authentication can be provided using the cryptographic techniques that use
secret keys as done in case of encryption.
Message Authentication Code (MAC): MAC algorithm is a symmetric key cryptographic
technique to provide message authentication. For establishing MAC process, the sender and
receiver share a symmetric key K.
Essentially, a MAC is an encrypted checksum generated on the underlying message that is
sent along with a message to ensure message authentication.
The process of using MAC for authentication is depicted in the following illustration.”
Let us now try to understand the entire process in detail “
1. The sender uses some publicly known MAC algorithm, inputs the message and the
secret key K and produces a MAC value.
2. Similar to hash, MAC function also compresses an arbitrary long input into a fixed length
output. The major difference between hash and MAC is that MAC uses secret key during
the compression.
3. The sender forwards the message along with the MAC. Here, we assume that the
message is sent in the clear, as we are concerned of providing message origin
authentication, not confidentiality. If confidentiality is required then the message needs
encryption.
4. On receipt of the message and the MAC, the receiver feeds the received message and the
shared secret key K into the MAC algorithm and re –computes the MAC value.
5. The receiver now checks equality of freshly computed MAC with the MAC received from
the sender. If they match, then the receiver accepts the message and assures himself
that the message has been sent by the intended sender.
6. If the computed MAC does not match the MAC sent by the sender, the receiver cannot
determine whether it is the message that has been altered or it is the origin that has
been falsified. As a bottom – line, a receiver safely assumes that the message is not the
genuine.
SHORT ANSWER QUESTIONS
1. Discuss the features of distributed file system?
(i) Transparency:
a. Structure transparency: Clients should not know the number or locations of file
servers and the storage devices.
Note: multiple file servers provided for performance, scalability, and reliability.
b. Access transparency: Both local and remote files should be accessible in the same
way. The file system should automatically locate an accessed file and transport it to
the clients site.
c. Naming transparency: The name of the file should give no hint as to the location of
the file. The name of the file must not be changed when moving from one node to
another.
d. Replication transparency: If a file is replicated on multiple nodes, both the
existence of multiple copies and their locations should be hidden from the clients.
(ii) User mobility:
Automatically bring the users environment (e.g. users home directory) to the node
where the user logs in.
(iii) Performance:
Performance is measured as the average amount of time needed to satisfy client
requests. This time includes CPU time + time for accessing secondary storage + network
access time. It is desirable that the performance of a distributed file system be
comparable to that of a centralized file system.
(iv) Simplicity and ease of use:
User interface to the file system be simple and number of commands should be as small
as possible.
(v) Scalability:
Growth of nodes and users should not seriously disrupt service.
(vi) High availability:
A distributed file system should continue to function in the face of partial failures such
as link failure, a node failure, or a storage device crash.
A highly reliable and scalable distributed file system should have multiple and
independent file servers controlling multiple and independent storage devices.
(vii) High reliability:
Probability of loss of stored data should be minimized. System should automatically
generate backup copies of critical files.
(viii) Data integrity:
Concurrent access requests from multiple users who are competing to access the file
must be properly synchronized by the use of some form of concurrency control
mechanism. Atomic transactions can also be provided.
(ix) Security:
Users should be confident of the privacy of their data.
(x) Heterogeneity:
There should be easy access to shared data on diverse platforms (e.g. Unix workstation,
Wintel platform etc).

2. Explain the functions of distributed file system?


A file system is a subsystem of the operating system that performs file management
activities such as organization, storing, retrieval, naming, sharing and protection of files.
A file system frees the programmer from concerns about the details of space allocation and
layout of the secondary storage device.
The design and implementation of a distributed file system is more complex than a
conventional file system due to the fact that the users and storage devices are physically dispersed.
In addition to the functions of the file system of a single processor system, the distributed
file system supports the following:
(a) Remote information Sharing: Thus any node, irrespective of the physical location of
the file, can access the file.
(b) User mobility: User should be permitted to work on different nodes.
(c) Availability: For better fault-tolerance, files should be available for use even in the
event of temporary failure of one or more nodes of the system. Thus the system should maintain
multiple copies of the files, the existence of which should be transparent to the user.

3. Explain the File Accessing Models?


This depends on the method used for accessing remote files and the unit of data access.
(a) Accessing remote files: A distributed file system may use one of the following models
to service a clients file access request when the accessed file is remote:
(b) Remote service model: Processing of a clients request is performed at the servers
node. Thus, the clients request for file access is delivered across the network as a message to the
server, the server machine performs the access request, and the result is sent to the client. Need to
minimize the number of messages sent and the overhead per message.

(c) Data – caching model: This model attempts to reduce the network traffic of the
previous model by caching the data obtained from the server node. This takes advantage of the
locality feature of the found in file accesses. A replacement policy such as LRU is used to keep the
cache size bounded.

While this model reduces network traffic it has to deal with the cache coherency problem during
writes, because the local cached copy of the data needs to be updated, the original file at the server
node needs to be updated and copies in any other caches need to updated.

(d) Diskless workstations: A distributed file system, with its transparent remote – file
accessing capability, allows the use of diskless workstations in a system.

4. Explain the File Sharing Semantics?


The UNIX semantics is implemented in file systems for single CPU systems because it is the
most desirable semantics and because it is easy to serialize all read/write requests. Implementing
UNIX semantics in a distributed file system is not easy. One may think that this can be achieved in a
distributed system by disallowing files to be cached at client nodes and allowing a shared file to be
managed by only one file server that processes all read and write requests for the file strictly in the
order in which it receives them. However, even with this approach, there is a possibility that, due to
network delays, client requests from different nodes may arrive and get processed at the server
node in an order different from the actual order in which the requests were made.

Also, having all file access requests processed by a single server and disallowing caching on client
nodes is not desirable in practice due to poor performance, poor scalability, and poor reliability of
the distributed file system.

Hence distributed file systems implement a more relaxed semantics of file sharing. Applications
that need to guarantee UNIX semantics should provide mechanisms (e.g. mutex lock etc)
themselves and not rely on the underlying semantics of sharing provided by the file system.

5. Write the advantages of delayed – write scheme?


(a) Write accesses complete more quickly because the new value is written only client
cache. This results in a performance gain.
(b) Modified data may be deleted before it is time to send to send them to the server (e.g.
temporary data). Since modifications need not be propagated to the server this results
in a major performance gain.
(c) Gathering of all file updates and sending them together to the server is more efficient
than sending each update separately.

6. Explain the Replication Transparency?


Replication of files should be transparent to the users so that multiple copies of a replicated
file appear as a single logical file to its users. This calls for the assignment of a single
identifier/name to all replicas of a file.

In addition, replication control should be transparent, i.e., the number and locations of
replicas of a replicated file should be hidden from the user. Thus replication controlh must be
handled automatically in a user-transparent manner.

7. What are the tools for atomic transactions?


(a) Begin_transaction:
Place a begin entry in log
(b) Write:
Write updated data to log
(c) Abort_transaction:
Place abort entry in log
(d) End_transaction (i.e., commit):
i. Place commit entry in log
ii. Copy logged data to files
iii. Place done entry in log
(e) Crash recovery – search log:
i. If begin entry, lock for matching entries.
ii. If done, do nothing (all files have been updated).
iii. If abort undo any permanent changes that Trans action may have made.

iv. If commit but not done, copy updated blocks from log to files, then add done
entry.

8. Explain the File Replication?


High availability is a desirable feature of a good distributed file system and file replication is the
primary mechanism for improving file availability.

A replicated file is a file that has multiple copies, with each file on a separate file server.

Difference between Replication and Caching:

1. A replica of a fiel is associated with a server, whereas a cached copy is normally


associated with a client.
2. The existence of a cached copy is primarily dependent on the locality in file access
patterns, whereas the existence of a replica normally depends on availability and
performance requirements.
3. As compared to a cached copy, a replica is more persistent, widely known, secure,
available, complete, and accurate.
4. A cached copy is contingent upon a replica. Only by periodic revalidation with respect
ot a replica can a cached copy be useful.

Advantages of Replication:

1. Increased Availability: Alternate copies of a replicated data can be used when the
primarty copy is unaviable.
2. Increased Reliability: Due to the presence of redundant data files in the system,
recovery from catastrophic failure (e.g. hard drive crash) becomes possible.
3. Improved response time : It enables data to be accessed either locally or from a node to
which access time is lower than the primary copy access time.
4. Reduced network traffic: If a files replica is available with a file server that resides on a
clients node, the clients access request can be serviced locally, resulting in reduced
network traffic.
5. Improved system throughput: Several clients request for access to a file can be serviced
in parallel by different servers, resulting in improved system throughput.
6. Better scalability: Multiple file servers are available to service client requests since due
to file replication. This improves scalability.

You might also like