0% found this document useful (0 votes)
13 views57 pages

Unit 2 AOS

The document discusses Distributed Operating Systems (DOS), highlighting their structure, types, features, advantages, and disadvantages. It covers communication models such as message passing and remote procedure calls, as well as issues related to system failures and their causes. Additionally, it provides examples of DOS applications and design considerations for implementing these systems.

Uploaded by

ragavihr131211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views57 pages

Unit 2 AOS

The document discusses Distributed Operating Systems (DOS), highlighting their structure, types, features, advantages, and disadvantages. It covers communication models such as message passing and remote procedure calls, as well as issues related to system failures and their causes. Additionally, it provides examples of DOS applications and design considerations for implementing these systems.

Uploaded by

ragavihr131211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

ADVANCE OPERATING SYSTEM

Unit-2
DISTRIBUTED OPERATING SYSTEMS

Distributed Operating Systems: Issues – Communication


Primitives – Lampert’s Logical Clocks – Deadlock handling
strategies – Issues in deadlock detection and resolution-
distributed file systems –design issues – Case studies – The Sun
Network File System-Coda.

Distributed Operating System

A distributed operating system (DOS) is an essential type of


operating system.

Distributed systems use many central processors to serve multiple


real-time applications and users. As a result, data processing jobs
are distributed between the processors.

It connects multiple computers via a single communication


channel. Each of these systems has its own processor and
memory. Additionally, these CPUs communicate via high-speed
buses or telephone lines.

This operating system consists of numerous computers, nodes,


and sites joined together via LAN/WAN lines. It enables the
distribution of full systems on a couple of center processors, and
it supports many real-time products and different users.
Distributed operating systems can share their computing
resources and I/O files while providing users with virtual machine
abstraction.

Types of Distributed Operating System

There are various types of Distributed Operating systems. Some of


them are as follows:

1. Client-Server Systems
2. Peer-to-Peer Systems
3. Middleware
4. Three-tier
5. N-tier

Client-Server System

➢ This type of system requires the client to request a resource,


after which the server gives the requested resource. When a
client connects to a server, the server may serve multiple
clients at the same time.
➢ Client-Server Systems are also referred to as "Tightly Coupled
Operating Systems". This system is primarily intended for
multiprocessors and homogenous multicomputer.

Server systems can be divided into two parts:

1. Computer Server System

This system allows the interface, and the client then sends its own
requests to be executed as an action. After completing the activity,
it sends a back response and transfers the result to the client.

2. File Server System

It provides a file system interface for clients, allowing them to


execute actions like file creation, updating, deletion, and more.

Peer-to-Peer System

➢ The nodes play an important role in this system. The task is


evenly distributed among the nodes. Additionally, these
nodes can share data and resources as needed. Once again,
they require a network to connect.
➢ The Peer-to-Peer System is known as a "Loosely Couple
System". This concept is used in computer network
applications since they contain a large number of processors
that do not share memory or clocks.
➢ Each processor has its own local memory, and they interact
with one another via a variety of communication methods like
telephone lines or high-speed buses.

Middleware

Middleware enables the interoperability of all applications running


on different operating systems. Those programs are capable of
transferring all data to one other by using these services.

Three-tier

The information about the client is saved in the intermediate tier


rather than in the client, which simplifies development. This type
of architecture is most commonly used in online applications.

N-tier

When a server or application has to transmit requests to other


enterprise services on the network, n-tier systems are used.

Features of Distributed Operating System

There are various features of the distributed operating system.


Some of them are as follows:

Openness

It means that the system's services are freely displayed through


interfaces. Furthermore, these interfaces only give the service
syntax. For example, the type of function, its return type,
parameters, and so on. Interface Definition Languages are used to
create these interfaces (IDL).

Scalability

It refers to the fact that the system's efficiency should not vary as
new nodes are added to the system. Furthermore, the performance
of a system with 100 nodes should be the same as that of a system
with 1000 nodes.
Resource Sharing

Its most essential feature is that it allows users to share resources.


They can also share resources in a secure and controlled manner.
Printers, files, data, storage, web pages, etc., are examples of
shared resources.

Flexibility

A DOS's flexibility is enhanced by modular qualities and delivers a


more advanced range of high-level services. The kernel/
microkernel's quality and completeness simplify the
implementation of such services.

Transparency

It is the most important feature of the distributed operating


system. The primary purpose of a distributed operating system is
to hide the fact that resources are shared. Transparency also
implies that the user should be unaware that the resources he is
accessing are shared.

Heterogeneity

The components of distributed systems may differ and vary in


operating systems, networks, programming languages, computer
hardware, and implementations by different developers.

Fault Tolerance

Fault tolerance is that process in which user may continue their


work if the software or hardware fails.

Examples of Distributed Operating System

✓ Solaris:

It is designed for the SUN multiprocessor workstations

✓ OSF/1:

It's compatible with Unix and was designed by the Open


Foundation Software Company.

✓ Micros:
The MICROS operating system ensures a balanced
data load while allocating jobs to all nodes in the system.

✓ DYNIX:

It is developed for the Symmetry multiprocessor


computers.

✓ Locus:

It may be accessed local and remote files at the same


time without any location hindrance.

✓ Mach:

It allows the multithreading and multitasking


features.

Applications of Distributed Operating System

➢ Network Applications:

DOS is used by many network applications,


including the Web, peer-to-peer networks, multiplayer web-
based games, and virtual communities.

➢ Telecommunication Networks:

DOS is useful in phones and cellular networks. A


DOS can be found in networks like the Internet, wireless
sensor networks, and routing algorithms.

➢ Parallel Computation:

DOS is the basis of systematic computing, which


includes cluster computing and grid computing, and a variety
of volunteer computing projects.

➢ Real-Time Process Control:

real-time process control system operates with a


deadline, and such examples include aircraft control
systems.

Advantages and Disadvantages of Distributed Operating System


Advantages:

1. It may share all resources (CPU, disk, network interface,


nodes, computers, and so on) from one site to another,
increasing data availability across the entire system.
2. It reduces the probability of data corruption because all data
is replicated across all sites; if one site fails, the user can
access data from another operational site.
3. The entire system operates independently of one another, and
as a result, if one site crashes, the entire system does not
halt.
4. It increases the speed of data exchange from one site to
another site.
5. It is an open system since it may be accessed from both local
and remote locations.
6. It helps in the reduction of data processing time.

Disadvantages

1. It is hard to implement adequate security in DOS since the


nodes and connections must be secured.
2. The database connected to a DOS is relatively complicated
and hard to manage in contrast to a single-user system.
3. The underlying software is extremely complex and is not
understood very well compared to other systems.
4. The more widely distributed a system is, the more
communication latency can be expected. As a result, teams
and developers must choose between availability,
consistency, and latency.
5. Gathering, processing, presenting, and monitoring hardware
use metrics for big clusters can be a real issue.

ISSUES in DS:

Causes of Operating System Failure:

✓ A system failure may occur due to a hardware failure or a


significant software problem, leading the system to freeze,
reboot, or stop working completely. An error may or may not
be displayed on the screen because of a system failure.

✓ The computer may shut down without warning or error


message. If an error message is presented on Windows PCs,
it is frequently displayed as a Blue Screen of Death error.

There are mainly two reasons behind the operating system failure.
These reasons are as follows:

1. Software Problems
2. Hardware Problems

Software Problems:

There are various software problems that cause the operating


system failure. Some software problems are as follows:

1. Improper Drivers

❖ You need drivers to use additional hardware, which can


typically be downloaded from the internet. These drivers
could be infected with bugs. These flaws cause the operating
system to crash.
❖ Most modern operating systems include the "Safe Mode
Boot" option. Safe Mode Boot is used for troubleshooting and
locating faulty drivers. Only the most critical drivers are
loaded in Safe Mode Boot, not all of them.

2. Thrashing

❖ Deadlock happens when two programs are running need


control over a particular resource. The OS may attempt to
switch back and forth between the two programs during a
deadlock’

❖ It eventually leads to Thrashing, in which the hard disk is


overworked by excessively shifting information between
system memory and virtual memory, causing a system crash.

3. Corrupt Registry
❖ The registry is a small database that stores all of the detail
about the kernel, drivers, and programs. The OS searches its
registry before beginning any app.

❖ Registry corruption may occur as a result of erroneous


application removal, careless registry changes, or having too
many installed applications, among other things.

4. Virus

❖ On the system, a virus may replicate itself. Viruses are


particularly dangerous since they can modify and delete user
files and cause machines to crash.

❖ A virus is a small piece of code that is embedded in system


software. The virus becomes embedded in other files and
programs as the user interacts with the program, potentially
rendering the system unworkable.

5. Trojan Horse

The application saves the user's login details. It prevents user


details from being transferred to a rogue user, who can
subsequently log in and access system resources.

6. Slow System Performance

❖ The system's performance has become very slow. If you're


looking for how to recognize signs of operating system failure
on the internet, it is the ideal sign.

❖ you can check to see if you have installed the latest versions
of Windows on the system. Even security fixes must be kept
up to date. After that, the system will resume normal
operation.

7. Failure to Boot

❖ If you are unable to boot, the OS may have been damaged.


There have been changes to the system's boot order. You can
examine the booting process and sequence setup.

❖ In the case of an OS failure, you must reinstall the Windows


OS. Please keep in mind that the problem might be
significant. It is one of the best indicators of operating system
failure.

8. Compatibility Error

❖ This type of issue commonly arises when the old apps in


Windows no longer work due to the upgrade. When you
encounter this issue, you are aware that it is one of the
operating system breakdown symptoms, but you can easily
address it.

❖ In most cases, Windows offers an in-built capability that


allows applications to be made compatible with the new
version. If you are a computer expert and are familiar with
the language, you can run the software in compatibility
mode.

Hardware Problems:

There are various hardware problems that cause the operating


system failure. Some hardware problems are as follows:

1. Power Problem

The System Power Supply's improper operation can result in the


System being shut down immediately.

2. Overheating

Overheating is the important hardware issue of operating system


failure. Overheating is a simple one to rule out. A fan is built into
a computer's CPU to keep it cool. The fan may get worn and
inefficient over time, or it may just be unable to handle your
computer's job.

3. Motherboard Failure

A failed motherboard might cause a system failure as the computer


is unable to process requests or function in general.

4. RAM

A faulty RAM chip might cause system failures as the OS is unable


to access data stored on the RAM chip.
5. Bad Processor

A faulty processor can and typically causes a system failure as the


system may not function if the CPU is not functioning properly.

Comminicative primitive:

Models For Distributed Applications:

There are two models that are widely accepted to develop


distributed operating system.

1. Message Passing
2. Remote procedure call.

1.Message Passing Model:

1.The Message Passing model provides two


basiccommunication Primitives

2.The send Primitives has two parameters:


• A message and its destination.
3. The send primitive has also two parameters:
• The source of a message and a buffer for storing
the message.
4. An application of these primitives can be found in client
server computation model.

2.Blocking VS. Non Blocking Primitives:

1.In the standard message passing model messages are copied


three times

• From user buffer to kernel buffer


• From kernel buffer on sending computer to kernel
buffer to the kernel buffer on receiving computer.
• From receiving buffer to user buffer.

2. With non-blocking primitive, the send primitive return the


control to user process.

3.While the Receiving primitive respond by signaling and provide a


buffer to copy.
4.A significant disadvantages of non-blocking is that programming
becomes difficult.

5.In the unbuffered option, data is copied from one user buffer to
another user directly.

6.With Blocking primitives, the send primitive does not return the
control to the user program until the message has been sent (an
unreliable blocking primitive) or until an acknowledgment has
been received (a reliable blocking primitive).

7.In both cases user buffer can be reused.

Synchronous Vs AsynchronousPrimitives:

• With synchronous primitive, a send primitive is block until a


corresponding Receive primitive is executed at the receiving
computer.

• With asynchronous primitive,the messages are buffered.•A


send primitive is not block even if nocorresponding execution
of a receive primitive.

Remote Procedural Call:

1.A More natural way to communicate is throughProcedural call:

• every language supports it.


• semantics are well defined and understood.
• natural for programmers to use.

2.Programmer Using such a model must handle the following


details:

o Pairing of responses with request messages.


o Data representation.
o Knowing the address of remote machine on the
server
o Taking care of communication and system failure
Basic RPC Operation:

1.The RPC Mechanism is based on the observation that a


procedural call is well known for transfer of control and data
with in a program running an a single machine.
2.On invoking a remote procedure, the calling process is
suspended.•If any parameter are passed to the remote
machine where the procedure will execute.

3.On completion, the result are passed back from server to


client and resuming execution as if it had called a local
procedure.

Design Issues in RPC:

o RPC mechanism is based on the concept of stub


procedures.

o The server writer writes the server and links it with


the server-side stubs; the client writes her program
and links it with the client-side stub.

o The stubs are responsible for managing all details


of the remote communication between client and
server.
Structure:

• When a program (client) makes a remoteprocedure call,


say p(x,y), it actually makes a local call on a dummy
procedure or a client-side stub procedure p.

• The client-side stub procedure contruct a message


containing the identity of the remote procedure and
parameters and then send to remote machine.

• A stub procedure at server side stub receives the


message and makes a local call to the procedure
specified in the message.

• After execution control returns to the server stub


procedure which return the control to client side
stub.The stub procedures can be generated at compile
time or can be linked at run time.

Binding:

• Binding is process that determines the remote


procedure, and the machine on which it will be
executed.

• It may also check the compatibility of parameters


passed and procedure type called.

• Binding server essentially store the server machine


along with the services they provide.
• Another approach used for binding is where the client
specifies the machine and the servicerequired and the
binding server returns the port number for
communication.

RPC Problems:
1.Procedures reside on different machines
• This means we cannot simply jump to the start of the
procedure
• We need to use network communication techniques to
interact with the remote machine.

2.Procedures reside in different address spaces


• This means that we cannot pass pointers from caller to
callee because a pointer is only valid in one AS

3.Parameters and results need to be passed across te network

4.Machine can crash


• what happens if you call a remote procedure and the
remote machine crashes before returning
Marshalling:

• RPC uses message passing paradigm to communicate


over the communication network

• The client and server stubs are required to pack and


unpack parameters to form messages

• This task is known as parameter marshalling

Marshalling is a relatively complex activity·

• The server may implement multiple procedures and it


may be necessary for the client to specify which
procedure is to be invoke

• Different machines may use different character


representation

• Machine may represent integers differentlFloating point


numbers may also be different

• We must therefore send data from one machine to


another in some canonical form .For example, the Sun
XDR standard
• Worst of all, any pointer-based data structurescannot
be passed because client-side pointers will not be valid
on the server

• There are two main solutions to this problem:


1. Forbid pointer-based data structures
2. Encode the dta

• When encoding the data, we need to be careful to avoid


losing referential integrity

RPC Tools:
• Fortunately, we don't have to write the client and server
stub (skeleton) code ourselvesInstead

• we use tools that do most of the work for us


• These tools are known as interface generators or stub
generators
1. The most common one is RPCGEN
2. Also, Common Object Request Broker Architecture
(CORBA)

• Stub generators use an interface definitionlanguage


(IDL) to describe the client and server code

Paradigm for Building DistributedPrograms:

• There are two approaches for distributed Application.


• Communication Oriented Design
1. Begins with communication protocol
2. Message format and syntax
3. Design client sever component by specifying interaction
with incoming and outgoing messages.

• Application Oriented
1. Begin with the Application
2. Build and test a working version that operate on single
machine
3. Divide the program into two or more pieces and add
communication protocol to allow each piece to execute
on separate machine

Client/Server and RPC:


• Like conventional procedure call, the system, a remote
procedure call transfers control to the called procedure.

• The system suspend the execution of calling procedure


during the call and only allows the called procedure to
execute.

• When a remote program issues a response, it


corresponds to the execution of a return in a
conventional procedure call.

• Controls flows back to the caller and called procedure


ceases to execute.

• One remote procedure may call another remote


procedure.

• Conventional procedures actually accept a few


arguments and return only a few result.
• Server can accept or return arbitrary amount of data.

• It would be ideal if local and remote procedural calls


behaved identical
a. Network delays
b. Same address space
• Remote procedure call cannot pass pointer as
arguments RPC cannot share the Caller's Environment

Middleware And Object Oriented Middleware:

• Variety of commercial tools are available that use the


RPC paradigm to construct client-serversoftware.
• These tools are generally called middleware because
they fits between a conventional application program
and the network software.
• The role of middleware is to ease the task of designing,
programming and managing distributed applications
by providing a simple, consistent and integrated
distributed programming environment.
RPC Tools (examples):

MSRPC:

1. Microsoft Remote Procedural Call


2. Derived from DCE RPC
• Own IDL and Protocol that client and server stubs uses to
communicate

COBRA:

• Common object Request broker Architecture permits an


entire object to placed at server
• More Dynamic
• In conventional RPC programmers uses tool to create stub
procedure while in Cobra software creates proxies at run time

MSRPC2:

• Microsoft develop second generation of MSRPC.

COM/DCOM:

• Microsoft also develop COM and DCOM.


• The standard define a binary package referred to as
packaging scheme.
✓ All object are given globally unique names and all
object references use the global scheme.
• Distributed component object model extend com by creating
application level protocols.
• The Combination of COM and DCOM is called COM+.

Lamport’s Logical Clock:

Lamport’s Logical Clock was created by Leslie Lamport. It is a


procedure to determine the order of events occurring. It provides a
basis for the more advanced Vector Clock Algorithm. Due to the
absence of a Global Clock in a Distributed Operating System
Lamport Logical Clock is needed.
Algorithm:

• Happened before relation(->): a -> b, means ‘a’ happened


before ‘b’.
• Logical Clock: The criteria for the logical clocks are:
• [C1]: Ci (a) < Ci(b), [ Ci -> Logical Clock, If ‘a’ happened before
‘b’, then time of ‘a’ will be less than ‘b’ in a particular process.
]
• [C2]: Ci(a) < Cj(b), [ Clock value of Ci(a) is less than Cj(b) ]

Reference:

• Process: Pi
• Event: Eij, where i is the process in number and j: jth event
in the ith process.
• tm: vector time span for message m.
• Ci vector clock associated with process Pi, the jth element is
Ci[j] and contains Pi‘s latest value for the current time in
process Pj.
• d: drift time, generally d is 1.

Implementation Rules[IR]:

• [IR1]: If a -> b [‘a’ happened before ‘b’ within the same


process] then, Ci(b) =Ci(a) + d
• [IR2]: Cj = max(Cj, tm + d) [If there’s more number of
processes, then tm = value of Ci(a), Cj = max value between
Cj and tm + d]

For Example:
• Take the starting value as 1, since it is the 1st event and there
is no incoming value at the starting point:

e11 = 1

e21 = 1

• The value of the next point will go on increasing by d (d = 1),


if there is no incoming value i.e., to follow [IR1].

e12 = e11 + d = 1 + 1 = 2

e13 = e12 + d = 2 + 1 = 3

e14 = e13 + d = 3 + 1 = 4

e15 = e14 + d = 4 + 1 = 5

e16 = e15 + d = 5 + 1 = 6

e22 = e21 + d = 1 + 1 = 2

e24 = e23 + d = 3 + 1 = 4

e26 = e25 + d = 6 + 1 = 7

• When there will be incoming value, then follow [IR2] i.e., take
the maximum value between Cj and Tm + d.

e17 = max(7, 5) = 7, [e16 + d = 6 + 1 = 7, e24 + d = 4 + 1 = 5,


maximum among 7 and 5 is 7]
e23 = max(3, 3) = 3, [e22 + d = 2 + 1 = 3, e12 + d = 2 + 1 = 3,
maximum among 3 and 3 is 3]

e25 = max(5, 6) = 6, [e24 + 1 = 4 + 1 = 5, e15 + d = 5 + 1 = 6,


maximum among 5 and 6 is 6]

Limitation:

• In case of [IR1], if a -> b, then C(a) < C(b) -> true.


• In case of [IR2], if a -> b, then C(a) < C(b) -> May be true or
may not be true.

Below is the C program to implement Lamport’s Logical Clock:

C++

#include <bits/stdc++.h>

using namespace std;

int max1(int a, int b)

if (a > b)

return a;
else

return b;

void display(int e1, int e2,

int p1[5], int p2[3])

int i;

cout << "\nThe time stamps of "

"events in P1:\n";

for (i = 0; i < e1; i++) {

cout << p1[i] << " ";

cout << "\nThe time stamps of "

"events in P2:\n";

for (i = 0; i < e2; i++)

cout << p2[i] << " ";

void lamportLogicalClock(int e1, int e2,

int m[5][3])

int i, j, k, p1[e1], p2[e2];

for (i = 0; i < e1; i++)

p1[i] = i + 1
for (i = 0; i < e2; i++)

p2[i] = i + 1;

cout << "\t";

for (i = 0; i < e2; i++)

cout << "\te2" << i + 1;

for (i = 0; i < e1; i++) {

cout << "\n e1" << i + 1 << "\t";

for (j = 0; j < e2; j++)

cout << m[i][j] << "\t";

for (i = 0; i < e1; i++) {

for (j = 0; j < e2; j++) {

if (m[i][j] == 1) {

p2[j] = max1(p2[j], p1[i] + 1);

for (k = j + 1; k < e2; k++)

p2[k] = p2[k - 1] + 1;

if (m[i][j] == -1) {

p1[i] = max1(p1[i], p2[j] + 1);

for (k = i + 1; k < e1; k++)

p1[k] = p1[k - 1] + 1;

}
}

display(e1, e2, p1, p2);

int main()

int e1 = 5, e2 = 3, m[5][3];

m[0][0] = 0;

m[0][1] = 0;

m[0][2] = 0;

m[1][0] = 0;

m[1][1] = 0;

m[1][2] = 1;

m[2][0] = 0;

m[2][1] = 0;

m[2][2] = 0;

m[3][0] = 0;

m[3][1] = 0;

m[3][2] = 0;

m[4][0] = 0;

m[4][1] = -1;

m[4][2] = 0;

lamportLogicalClock(e1, e2, m);

return 0;
}

Deadlock handling strategies:

The following are the strategies used for Deadlock Handling in


Distributed System:

1. Deadlock Prevention
2. Deadlock Avoidance
3. Deadlock Detection and Recovery

Deadlock Prevention

If we simulate deadlock with a table which is standing on its four


legs then we can also simulate four legs with the four conditions
which when occurs simultaneously, cause the deadlock.

However, if we break one of the legs of the table then the table will
fall definitely. The same happens with deadlock, if we can be able
to violate one of the four necessary conditions and don't let them
occur together then we can prevent the deadlock.

1. Mutual Exclusion

• Mutual section from the resource point of view is the fact that
a resource can never be used by more than one process
simultaneously which is fair enough but that is the main
reason behind the deadlock.

• If a resource could have been used by more than one process


at the same time then the process would have never been
waiting for any resource.

Spooling

• spooling can work. There is a memory associated with the


printer which stores jobs from each of the process into it.
Later, Printer collects all the jobs and print each one of them
according to FCFS.

• By using this mechanism, the process doesn't have to wait


for the printer and it can continue whatever it was doing.
Later, it collects the output when it is produced.
1. This cannot be applied to every resource.
2. After some point of time, there may arise a race condition
between the processes to get space in that spool.

We cannot force a resource to be used by more than one process


at the same time since it will not be fair enough and some serious
problems may arise in the performance. Therefore, we cannot
violate mutual exclusion for a process practically.

2. Hold and Wait

• Hold and wait condition lies when a process holds a resource


and waiting for some other resource to complete its task.
Deadlock occurs because there can be more than one process
which are holding one resource and waiting for other in the
cyclic order.

• However, we have to find out some mechanism by which a


process either doesn't hold any resource or doesn't wait. That
means, a process must be assigned all the necessary
resources before the execution starts. A process must not
wait for any resource once the execution has been started.

!(Hold and wait) = !hold or !wait (negation of hold and wait is,
either you don't hold or you don't wait)

• This can be implemented practically if a process declares all


the resources initially. However, this sounds very practical
but can't be done in the computer system because a process
can't determine necessary resources initially.
• Process is the set of instructions which are executed by the
CPU. Each of the instruction may demand multiple resources
at the multiple times. The need cannot be fixed by the OS.

The problem with the approach is:

1. Practically not possible.


2. Possibility of getting starved will be increases due to the fact
that some process may hold a resource for a very long time.

3. No Preemption

• Deadlock arises due to the fact that a process can't be


stopped once it starts. However, if we take the resource away
from the process which is causing deadlock then we can
prevent deadlock.

• This is not a good approach at all since if we take a resource


away which is being used by the process then all the work
which it has done till now can become inconsistent.

• Consider a printer is being used by any process. If we take


the printer away from that process and assign it to some
other process then all the data which has been printed can
become inconsistent and ineffective and also the fact that the
process can't start printing again from where it has left which
causes performance inefficiency.

4. Circular Wait

To violate circular wait, we can assign a priority number to each of


the resource. A process can't request for a lesser priority resource.
This ensures that not a single process can request a resource
which is being utilized by some other process and no cycle will be
formed.

Deadlock avoidance
• In deadlock avoidance, the request for any resource will be
granted if the resulting state of the system doesn't cause
deadlock in the system. The state of the system will
continuously be checked for safe and unsafe states.
• In order to avoid deadlocks, the process must tell OS, the
maximum number of resources a process can request to
complete its execution.

• The simplest and most useful approach states that the


process should declare the maximum number of resources of
each type it may ever need. The Deadlock avoidance
algorithm examines the resource allocations so that there can
never be a circular wait condition.

Safe and Unsafe States

The resource allocation state of a system can be defined by the


instances of available and allocated resources, and the maximum
instance of the resources demanded by the processes.

Resources Assigned
Process Type Type Type Type
1 2 3 4

A 3 0 2 2

B 0 0 1 1

C 1 1 1 0

D 2 1 4 0

Resources still needed


Process Type Type Type Type
1 2 3 4

A 1 1 0 0
B 0 1 1 2

C 1 2 1 0

D 2 1 1 2

1. E = (7 6 8 4)
2. P = (6 2 8 3)
3. A = (1 4 0 1)

• Above tables and vector E, P and A describes the resource


allocation state of a system. There are 4 processes and 4
types of the resources in a system. Table 1 shows the
instances of each resource assigned to each process.
• Table 2 shows the instances of the resources, each process
still needs. Vector E is the representation of total instances
of each resource in the system.
• Vector P represents the instances of resources that have been
assigned to processes. Vector A represents the number of
resources that are not in use.
• A state of the system is called safe if the system can allocate
all the resources requested by all the processes without
entering into deadlock.
• If the system cannot fulfill the request of all processes then
the state of the system is called unsafe.
• The key of Deadlock avoidance approach is when the request
is made for resources then the request must only be approved
in the case if the resulting state is also a safe state.

Deadlock Detection and Recovery

The OS doesn't apply any mechanism to avoid or prevent the


deadlocks. Therefore the system considers that the deadlock will
definitely occur. In order to get rid of deadlocks, The OS
periodically checks the system for any deadlock. In case, it finds
any of the deadlock then the OS will recover the system using some
recovery techniques.
The main task of the OS is detecting the deadlocks. The OS can
detect the deadlocks with the help of Resource allocation graph.

In single instanced resource types, if a cycle is being formed in the


system then there will definitely be a deadlock. On the other hand,
in multiple instanced resource type graph, detecting a cycle is not
just enough. We have to apply the safety algorithm on the system
by converting the resource allocation graph into the allocation
matrix and request matrix.

For Resource

Preempt the resource

We can snatch one of the resources from the owner of the resource
(process) and give it to the other process with the expectation that
it will complete the execution and will release this resource sooner.
Well, choosing a resource which will be snatched is going to be a
bit difficult.

Rollback to a safe state

System passes through various states to get into the deadlock


state. The operating system canrollback the system to the previous
safe state. For this purpose, OS needs to implement check pointing
at every state.

The moment, we get into deadlock, we will rollback all the


allocations to get into the previous safe state.
For Process

Kill a process

Killing a process can solve our problem but the bigger concern is
to decide which process to kill. Generally, Operating system kills a
process which has done least amount of work until now.

Kill all process

This is not a suggestible approach but can be implemented if the


problem becomes very serious. Killing all process will lead to
inefficiency in the system because all the processes will execute
again from starting.

Deadlock issues in deadlock detection & resolution:

Table of Contents
o Deadlock
• Deadlock Detection:
o 1. Resource Allocation Graph (RAG) Algorithm:
o 2. Resource-Requesting Algorithms:
• Deadlock Resolution:
o 1. Deadlock Prevention:
o 2. Deadlock Avoidance:
o 3. Deadlock Detection with Recovery:
Deadlock
Deadlock is a fundamental problem in distributed

systems.
• A process may request resources in any order, which may
not be known a priori and a process can request resource
while holding others.
• If the sequence of the allocations of resources to the
processes is not controlled.
• A deadlock is a state where a set of processes request
resources that are held by other processes in the set.
DEADLOCK DETECTION:

1. Resource Allocation Graph (RAG) Algorithm:

• Deadlock detection typically involves constructing a


resource allocation graph based on the current resource
allocation and request status.
• The RAG algorithm identifies cycles in the graph,
indicating the presence of a potential deadlock.
• However, the RAG algorithm suffers from scalability
issues in large systems due to the overhead of
maintaining the graph.
2. Resource-Requesting Algorithms:

Another approach is to periodically check the state of



resource requests and allocations to identify potential
deadlocks.
• This approach involves tracking the resource allocation
state and examining resource requests to detect circular
waits.
• However, this method may have high overhead and can
only identify deadlocks when they occur during the
detection phase.
DEADLOCK RESOLUTION:

1. Deadlock Prevention:

• Prevention involves ensuring that at least one of the


necessary conditions for deadlock (mutual exclusion,
hold and wait, no preemption, circular wait) is not
satisfied.
• By carefully managing resource allocation and enforcing
certain policies, deadlocks can be avoided altogether.
• However, prevention methods can be complex, restrictive,
and may limit system performance or resource utilization.
2. Deadlock Avoidance:

• Avoidance involves dynamically analyzing resource


requests and allocations to ensure that the system avoids
entering an unsafe state where a deadlock can occur.

• Resource allocation is made based on resource


requirement forecasts and resource availability to prevent
circular waits.

• Avoidance requires a safe state detection algorithm to


determine if a resource allocation will lead to a deadlock.

• However, avoidance techniques may suffer from


increased overhead and may limit system responsiveness.
3. Deadlock Detection with Recovery:

• Deadlock detection algorithms can be used to periodically


check the system’s state for potential deadlocks.

• Once a deadlock is detected, recovery mechanisms can be


employed to resolve the deadlock.

• Recovery may involve aborting one or more processes,


rolling back their progress, and reallocating resources to
allow the system to continue.

• However, recovery mechanisms can be complex and may


result in data loss or system instability.
Distributed File System:

The distributed file system in the operating system and its


features, components, advantages, and disadvantages.

What is Distributed File System?

A distributed file system (DFS) is a file system that is distributed


on various file servers and locations. It permits programs to access
and store isolated data in the same method as in the local files. It
also permits the user to access files from any system. It allows
network users to share information and files in a regulated and
permitted manner.

DFS's primary goal is to enable users of physically distributed


systems to share resources and information through the Common
File System (CFS). It is a file system that runs as a part of
the operating systems. Its configuration is a set of workstations
and mainframes that a LAN connects. The process of creating a
namespace in DFS is transparent to the clients.

DFS has two components in its services, and these are as follows:

1. Local Transparency
2. Redundancy

Local Transparency

It is achieved via the namespace component.

Redundancy

• It is achieved via a file replication component.


• In the case of failure or heavy load, these components work
together to increase data availability by allowing data from
multiple places to be logically combined under a single folder
known as the "DFS root".
• It is not required to use both DFS components
simultaneously; the namespace component can be used
without the file replication component, and the file replication
component can be used between servers without the
namespace component.

Features

There are various features of the DFS. Some of them are as follows:

Transparency

There are mainly four types of transparency. These are as follows:

1. Structure Transparency

The client does not need to be aware of the number or location of


file servers and storage devices. In structure transparency,
multiple file servers must be given to adaptability, dependability,
and performance.

2. Naming Transparency

There should be no hint of the file's location in the file's name.


When the file is transferred form one node to other, the file name
should not be changed.

3. Access Transparency

Local and remote files must be accessible in the same method. The
file system must automatically locate the accessed file and deliver
it to the client.

4. Replication Transparency

When a file is copied across various nodes, the copies files and
their locations must be hidden from one node to the next.

Scalability

The distributed system will inevitably increase over time when


more machines are added to the network, or two networks are
linked together. A good DFS must be designed to scale rapidly as
the system's number of nodes and users increases.

Data Integrity

Many users usually share a file system. The file system needs to
secure the integrity of data saved in a transferred file. A
concurrency control method must correctly synchronize
concurrent access requests from several users who are competing
for access to the same file. A file system commonly provides users
with atomic transactions that are high-level concurrency
management systems for data integrity.

High Reliability

The risk of data loss must be limited as much as feasible in an


effective DFS. Users must not feel compelled to make backups of
their files due to the system's unreliability. Instead, a file system
should back up key files so that they may be restored if the
originals are lost. As a high-reliability strategy, many file systems
use stable storage.
High Availability

A DFS should be able to function in the case of a partial failure,


like a node failure, a storage device crash, and a link failure.

Ease of Use

The UI of a file system in multiprogramming must be simple, and


the commands in the file must be minimal.

Performance

The average time it takes to persuade a client is used to assess


performance. It must perform similarly to a centralized file system.

Distributed File System Replication

Initial versions of DFS used Microsoft's File Replication Service


(FRS), enabling basic file replication among servers. FRS detects
new or altered files and distributes the most recent versions of the
full file to all servers.

Windows Server 2003 R2 developed the "DFS Replication"


(DFSR). It helps to enhance FRS by only copying the parts of files
that have changed and reducing network traffic with data
compression. It also gives users the ability to control network
traffic on a configurable schedule using flexible configuration
options.

History of Distributed File System

The DFS's server component was firstly introduced as an


additional feature. When it was incorporated into Windows NT 4.0
Server, it was called "DFS 4.1". Later, it was declared a standard
component of all Windows 2000 Server editions. Windows NT
4.0 and later versions of Windows have client-side support.

Working of Distributed File System

There are two methods of DFS in which they might be


implemented, and these are as follows:

1. Standalone DFS namespace


2. Domain-based DFS namespace
Standalone DFS namespace

It does not use Active Directory and only permits DFS roots that
exist on the local system. A Standalone DFS may only be acquired
on the systems that created it. It offers no-fault liberation and may
not be linked to other DFS.

Domain-based DFS namespace

It stores the DFS configuration in Active Directory and creating


namespace root at domainname>dfsroot> or FQDN>dfsroot>.

DFS namespace

SMB routes of the form are used in traditional file shares that are
linked to a single server.

\\<SERVER>\<path>\<subpath>

Domain-based DFS file share paths are identified by utilizing the


domain name for the server's name throughout the form.

\\<DOMAIN.NAME>\<dfsroot>\<path>

When users access such a share, either directly or through


mapping a disk, their computer connects to one of the accessible
servers connected with that share, based on rules defined by the
network administrator. For example, the default behavior is for
users to access the nearest server to them; however, this can be
changed to prefer a certain server.

Applications of Distributed File System

There are several applications of the distributed file system. Some


of them are as follows:

Hadoop

Hadoop is a collection of open-source software services. It is a


software framework that uses the MapReduce programming style
to allow distributed storage and management of large amounts of
data. Hadoop is made up of a storage component known
as Hadoop Distributed File System (HDFS). It is an operational
component based on the MapReduce programming model.
NFS (Network File System)

A client-server architecture enables a computer user to store,


update, and view files remotely. It is one of various DFS standards
for Network-Attached Storage.

SMB (Server Message Block)

IBM developed an SMB protocol to file sharing. It was developed to


permit systems to read and write files to a remote host across a
LAN. The remote host's directories may be accessed through SMB
and are known as "shares".

Design Issues of Distributed System:

Distributed System is a collection of autonomous computer


systems that are physically separated but are connected by a
centralized computer network that is equipped with distributed
system software. These are used in numerous applications, such
as online gaming, web applications, and cloud computing.
However, creating a distributed system is not simple, and there are
a number of design considerations to take into account. The
following are some of the major design issues of distributed
systems:

Design Issues of Distributed Systems


1. Heterogeneity
2. Openness
3. Security
4. Synchronization
5. Absence of global clock
6. Partial failures
7. Scalability
8. transparency
1.Heterogeneity:
• The distributed system contains many different kinds of
hardware and software working together in cooperative
fashion to solve problems.
• There may be many different representations of data in the
system this might include different representations for
integers, byte streams, floating point numbers and character
sets.
• There may be many different instructions sets. An
application compiled for one instruction set cannot be easily
run on a computer with another instruction set unless an
instruction set interpreter is provided
• Components in the distributed system have different
capabilities like faster clock cycles, larger memory capacity,
bigger disk farms, printers and other peripherals and
different services
High Degree of node heterogeneity:
o High-performance parallel systems (multiprocessors as
well as multicomputer)
o High-end PCs and workstations (servers)
o Simple network computers (offer users only network
access)
o Mobile computers (palmtops, laptops)
o Multimedia workstations
High degree of network heterogeneity:
o Local area gigabit networks
o Wireless connections
o Long-haul, high-latency connections

2.Openness:
• The openness of a computer system is the characteristic
that determines whether the system can be extended and
reimplemented in various ways
• The challenge to designers is to tackle the complexity of
distributed systems consisting of many components
engineered by different people
• Open systems are characterized by the fact that their key
interfaces are published\
• Open distributed systems are based on the provision of a
uniform communication mechanism and published
interfaces for access to shared resources
• Open distributed systems can be constructed from
heterogeneous hardware and software, possibly from
different vendors
3.Security
• Shared data must be protected
❖ Privacy - avoid unintentional disclosure of private data
❖ Security – data is not revealed to unauthorized parties
❖ Integrity – protect data and system state from corruption
• Denial of service attacks – put significant load on the
system, prevent users from accessing it
Security in detail concerned in the following areas:
❖ Authentication, Authorization/Access control: are the
means to identify the right user and user right.

❖ Critical Infrastructure Protection: CIP is the protection of


information systems for critical infrastructures including
telecommunications, energy, financial services,
manufacturing, water, transportation, health care and
emergency services sectors

❖ Distributed Trust and Policy Management: designed to


address the authorization needs for the next-generation
distributed systems. A trust management system is a term
coined to refer to a unified framework for the speciation of
security policies, the representation of credentials, and the
evaluation and enforcement of policy compliances.

❖ Multicasting security and IPR Protection: defines the


common architecture for multicast security(MSEC) key
management protocols to support a variety of application,
transport, network layer security protocol and the intellectual
property rights
❖ Multimedia Security: is intended to provide an advanced
multimedia application course with its focus on security. two
major areas of concern- to ensure secure uses of multimedia
data and to use multimedia data for security applications

❖ Risk analysis, Assessment, Management: A security policy


framework is necessary to support the security infrastructure
required for the secure movement of sensitive information
across and within national boundaries
Synchronization:
1.Concurrent cooperating tasks need to synchronize
• When accessing shared data
• When performing a common task
2. Synchronization must be done correctly to prevent data
corruption:
• Example: two account owner; one deposits the money,
the other one withdraws; they act concurrently
• How to ensure the bank account is in “correct” state
after these actions?
3. Synchronization implies communication
4. Communication can take a long time
5. Excessive synchronization can limit effectiveness and
scalability of distribute system
Absence of Global Clock:
• Cooperating task need to agree on the order of events
• Each task its own notion of time
• Clocks cannot be perfectly synchronized
• How to determine which even occurred first?

Example:
Bank account, starting balance = $100
Client at bank machine A makes a deposit of $150
Client at bank machine B makes a withdrawal of $100
Which event happened first?
Should the bank charge the overdraft fee?

Partial Failures:
• Detection of failures - may be impossible
• Has a component crashed? Or is it just show?
• Is the network down? Or is it just slow?
• If it’s slow – how long should we wait?
• Handling of failures
• Re-transmission
• Tolerance for failures
• Roll back partially completed task
• Redundancy against failures
• Duplicate network routes
• Replicated databases

Scalability
• Does the system remain effective as of grows?
• As you add more components:
• More synchronization
• More communication à the system runs slowly.
• Avoiding performance bottlenecks:
• Everyone is waiting for a single shared resource
• In a centrally coordinated system, everyone waits for the co-
coordinator
Transparency:

Distributed systems designers must hide the complexity of the


systems as much as they can. Adding abstraction layer is
particularly useful in distributed systems.
Example: While users hit search in google.com, they never notice
that their query goes through a complex process before Google
shows them a result
• Concealing the heterogeneous and distributed nature of the
system so that it appears to the user like one system

Transparency categories:

Access: access local and remote resources using identical


operations (NFS or Samba-mounted file systems)
Location: access without knowledge of location of a resource
(URL’s, e-mail)
Concurrency: allow several processes to operate concurrently
using shared resources in a consistent fashion (two users
simultaneously accessing the bank account)

Transparency categories

Mobility: allow resources to move around


Performance: adaption of the system to varying load situations
without the user noticing it
Scaling: allow system and applications to expand without need to
change structure of applications or algorithms

Case Study of a Distributed Operating System

Introduction to Amoeba

• Originated at a university in Holland, 1981


• Currently used in various EU countries

• Built from the ground up. UNIX emulation added later


• Goal was to build a transparent distributed operating system

• Resources, regardless of their location, are managed by the


system, and the user is unaware of where processes are
actually run

The Amoeba System Architectur:

• Assumes that a large number of CPUsare available and that


each CPU ha 10s of Mb of memory

• CPUs are organised into processor pools

• CPUs do not need to be of the same architecture (can mix


SPARC, Motorola PowerPC, 680x0, Intel, Pentium, etc.)

• When a user types a command, system determines which


CPU(s) to execute it on. CPUs can be timeshared.
• Terminals are X-terminals or PCs running X emulators

• The processor pool doesn't have to be composed of CPU


boards enclosed in a cabinet, they can be on PCs, etc., in
different rooms, countries,...

• Some servers (e.g., file servers) run on dedicated processors,


because they need to be available all the time
The Amoeba Microkernel:

• The Amoeba microkernel is used on all terminals (with an on-


board processor), processors, and servers

• The microkernel

o manages processes and threads

o provides low-level memory management support

o supports interprocess communication (point-to-point


and group)

o handles low-level I/O for the devices attached to the


machine

The Amoeba Servers: Introduction:

• OS functionality not provided by the microkernel is


performed by Amoeba servers

• To use a server, the client calls a stub procedure which


marshalls parameters, sends the message, and blocks until
the result comes back

Server Basics:

• Amoeba uses capabilities


• Every OS data structure is an object, managed by a server
• To perform an operation on an object, a client performs an RPC
with the appropriate server, specifying the object, the operation
to be performed and any parameters needed.
• The operation is transparent (client does not know where server
is, nor how the operation is performed)
• Capabilites

• To create an object the client performs an RPC with the server

• Server creates the object and returns a capability

• To use the object in the future, the client must present the
correct capability

• The check field is used to protect the capability against forgery

Object protection:
• When an object is created, server generates random check
field, which it stores both in the capability and in its own
tables

• The rights bits in the capability are set to on

• The server sends the owner capability back to the client


Creating a capability with restricted rights
• Client can send this new capability to another process

Process Management:

• All processes are objects protected by capabilities


• Processes are managed at 3 levels by process servers, part of
the microkernel by library procedures which act as interfaces
by the run server, which decides where to run the processes.

• Process management uses process descriptors

Contains:
• platform description process' owner's capability etc

Memory Management:

• Designed with performance, simplicity and economics in


mind
• Process occupies contiguous segments in memory
• All of a process is constantly in memory
• Process is never swapped out or paged

Communication:
• Point-to-point (RPC) and Group
The Amoeba Servers:
The File System

• Consists of the Bullet (File) Server, the Directory Server, and


the Replication Server

The Bullet Server


• Designed to run on machines with large amounts of RAM and
huge local disks
• Used for file storage
• Client process creates a file using the create call
• Bullet server returns a capability that can be used to read the
file with

• Files are immutable, and file size is known at file creation


time. Contiguous allocation policies used
The Directory Server:

• Used for file naming


• Maps from ASCII names to capabilities
• Directories also protected by capabilities
• Directory server can be used to name ANY object, not just
files and directories

The Replication Server:

• Used for fault tolerence and performance


• Replication server creates copies of files, when it has time

• Other Amoeba Servers

• The Run Server

• When user types a command, two decisions have to be made

o On which architecture should the process be run?

o Which processor should be chosen?


• Run server manages the processor pools

• Uses processes process descriptor to identify appropriate


target architecture

• Checks which of the available processors have sufficient


memory to run the process Estimates which of the
remaining processor has the most available compute power

The Boot Server:

• Provides a degree of fault tolerance


• Ensures that servers are up and running

• If it discovers that a server has crashed, it attempts to restart


it, otherwise selects another processor to provide the service

• Boot server can be replicated to guard against its own failurei

SUN Network File System (NFS) :


The Sun Network File System (NFS) has become a common standard
for distributed UNIX file access
• NFS runs over LANs (even over WANs – slowly)
• Basic idea
1. allow a remote directory to be “mounted” (spliced) onto a local
directory
2. Gives access to that remote directory and all its descendants as
if they were part of the local hierarch
• Pretty much exactly like a “local mount” or “link” on UNIX
1. except for implementation and performance …
2. no, we didn’t really learn about these, but they’re obvious
• For instance:
• I mount /u4/lazowska on Node1 onto /students/foo on Node2
• users on Node2 can then access this directory as /students/foo
• if I had a file /u4/lazowska/myfile, users on Node2 see it as
/students/foo/myfile • Just as, on a local system, I might link
/cse/www/education/courses/451/06wi/ as
/u4/lazowska/451 to allow easy access to my web data from
my home directory
NFS implementation • NFS defines a set of RPC operations for remote
file access:
– searching a directory
– reading directory entries
– manipulating links and directories
– reading/writing files
• Every node may be both a client and server
• NFS defines new layers in the Unix file system System Call Interface
Virtual File System buffer cache / i-node table (local files) (remote
files) UFS NFS The virtual file system (VFS) provides a standard
interface, using v-nodes as file handles. A v-node describes either
a local or remote file. RPCs to other (server) nodes RPC requests
from remote clients, and server responses

NFS caching / sharing :


• On an open, the client asks the server whether its cached blocks
are up to date.
• Once a file is open, different clients can write it and get inconsistent
data.
• Modified data is flushed back to the server every 30 seconds.

Example: CMU’s Andrew File System (AFS)


• Developed at CMU to support all of its student computing
• Consists of workstation clients and dedicated file server machines
(differs from NFS)
• Workstations have local disks, used to cache files being used locally
(originally whole files, subsequently 64K file chunks) (differs from
NFS)
• Andrew has a single name space – your files have the same names
everywhere in the world (differs from NFS)
• Andrew is good for distant operation because of its local disk
caching: after a slow startup, most accesses are to local disk.

AFS caching/sharing :

• Need for scaling required reduction of client-server message traffic


• Once a file is cached, all operations are performed locally
• On close, if the file has been modified, it is replaced on the server
• The client assumes that its cache is up to date, unless it receives a
callback message from the server saying otherwise
– on file open, if the client has received a callback on the file,
it must fetch a new copy; otherwise it uses its locally-cached copy
(differs from NFS)
Example: Berkeley Sprite File System:
• Unix file system developed for diskless workstations with large
memories at UCB (differs from NFS, AFS)
• Considers memory as a huge cache of disk blocks – memory is
shared between file system and VM
• Files are permanently stored on servers – servers have a large
memory that acts as a cache as well
• Several workstations can cache blocks for read-only files
• If a file is being written by more than 1 machine, client caching is
turned off – all requests go to the server (differs from NFS, AFS)
Summary :
• There are a number of issues to deal with:
– what is the basic abstraction
– naming
– caching
– sharing and coherency
– replication
– performance
• No right answer! Different systems make different tradeoffs!
• Performance is always an issue
– always a tradeoff between performance and the semantics of file
operations (e.g., for shared files).
• Caching of file blocks is crucial in any file system
– maintaining coherency is a crucial design issue.
• Newer systems are dealing with issues such as disconnected
operation for mobile computers.
CODA:

Introduction:

The Coda distributed file system is a state of the art experimental


file system developed in the group of M. Satyanarayanan at
Carnegie Mellon University. Numerous people contributed to
Coda which now incorporates many features not found in other
systems:Mobile Computing

• disconnected operation for mobile clients


o reintegration of data from disconnected clients
o bandwidth adaptation
• Failure Resilience
o read/write replication servers
o resolution of server/server conflicts
o handles of network failures which partition the servers
o handles disconnection of clients client
• Performance and scalability
o client side persistent caching of files, directories and
attributes for high performance
o write back caching
• Security
o kerberos like authentication
o access control lists (ACL's)
• Well defined semantics of sharing
• Freely available source code

Distributed file systems:


• A distributed file system stores files on one or more
computers called servers, and makes them accessible to
other computers called clients, where they appear as normal
files.

• There are several advantages to using file servers: the files


are more widely available since many computers can access
the servers, and sharing the files from a single location is
easier than distributing copies of files to individual clients.

• Backups and safety of the information are easier to arrange


since only the servers need to be backed up.

• The servers can provide large storage space, which might be


costly or impractical to supply to every client.

• There are many problems facing the design of a good


distributed file system. Transporting many files over the net
can easily create sluggish performance and latency, network
bottlenecks and server overload can result.
• A few new features are being implemented (write back
caching and cells for example) and in several areas
components of Coda are being reorganized. We have already
received very generous help from users on the net and we
hope that this will continue

Coda on a client:

• If Coda is running on a client, which we shall take to be a


Linux workstation, typing mount will show that a file
system --of type ``Coda'' -- mounted under /coda. All the
files, which any of the servers may provide to the client, are
available under this directory, and all clients see the same
name space.
• A client connects to ``Coda'' and not to individual servers,
which come into play invisibly. This is quite different from
mounting NFS file systems which is done on a per server,
per export basis.
• In the most common Windows systems (Novell and
Microsoft's CIFS) as well as with Appleshare on the
Macintosh, files are also mounted per volume.

• To understand how Coda can operate when the network


connections to the server have been severed, let's analyze a
simple file system operation. Suppose we type: ``cat
/coda/tmp/foo'' to display the contents of a Coda file.
• When the kernel passes the open request to Venus for the
first time, Venus fetches the entire file from the servers,
using remote procedure calls to reach the servers. It then
stores the file as a container file in the cache area
(currently/usr/coda/venus.cache/).
• The file is now an ordinary file on the local disk, and read-
write operations to the file do not reach Venus but are
(almost) entirely handled by the local file system (ext2 for
Linux).
• Coda read-write operations take place at the same speed as
those to local files. If the file is opened a second time, it will
not be fetched from the servers again, but the local copy will
be available for use immediately.
From caching to disconnected operation:
• The origin of disconnected operation in Coda lies in one of
the original research aims of the project: to provide a file
system with resilience to network failures.
• AFS, which supported 1000's of clients in the late 80's on
the CMU campus had become so large that network outages
and server failures happening somewhere almost every day
became a nuisance.
• It turned out to be a well timed effort since with the rapid
advent of mobile clients (viz. Laptops) and Coda's support
for failing networks and servers Coda equally applied to
mobile clients.
• when the update is complete on the client it has also been
made on the server. If a server is unavailable, or if the
network connections between client and server fail, such an
operation will incur a time-out error and fail.
• To support disconnected computers or to operate in the
presence of network failures, Venus will not report failure(s)
to the user when an update incurs a time-out.
• To support disconnected computers or to operate in the
presence of network failures, Venus will not report failure(s)
to the user when an update incurs a time-out.
• The second issue is that during reintegration it may appear
that during the disconnection another client has modified
the file too and have shipped it to the server. This is called a
local/global

Volumes, Servers and Server Replication:

• Files on Coda servers are not stored in traditional file systems. These
partitions will contain files which are grouped into volumes. Each volume
has a directory structure like a file system: i.e. a root directory for the
volume and a tree below it.

• Typically a single server would have some hundreds of volumes, perhaps


with an average size approximately 10MB. A volume is a manageable
amount of file data which is a very natural unit from the perspective of
system administration and has proven to be quite flexible.

• Coda holds volume and directory information, access control lists and file
attribute information in raw partitions. These are accessed through a log
based recoverable virtual memory package (RVM) for speed and
consistency.

• Only file data resides in the files in server partitions. RVM has built in
support for transactions - this means that in case of a server crash the
system can be restored to a consistent state without much effort.

• Coda identifies a file by a triple of 32bit integers called a Fid: it consists of a


VolumeId, a VnodeId and a Uniquifier. The VolumeId identifies the volume
in which the file resides. The VnodeId is the ``inode'' number of the file, and
the uniquifiers are needed for resolution. The Fid is unique in a cluster of
Coda servers.

• The advantage of this is higher availability of data: if one server fails others
take over without a client noticing the failure. Volumes can be stored on a
group of servers called the VSG (Volume Storage Group).

Coda in action:

Coda is in constant active use at CMU. Several dozen clients use it for
development work (of Coda), as a general-purpose file system and for specific
disconnected applications. The following two scenarios have exploited the
features of Coda very successfully.

There are a number of compelling future applications where Coda could provide
significant benefits.

• WWW replication servers should be Coda clients. Many ISPs are struggling
with a few WWW replication servers. They have too much access to use just
a single http server. Using NFS to share the documents to be served has
proven problematic due to performance problems, so manual copying of
files to the individual servers is frequently done.

• Network computers could exploit Coda as a cache to dramatically improve


performance. Updates to the network computer would automatically be
made when as they become available on servers, and for the most part the
computer would operate without network traffic, even after restarts.

Getting Coda:
Coda is available for ftp from ftp.coda.cs.cmu.edu. You will find RPM packages for
Linux as well as tar balls with source. Kernel support for Coda will come with the
Linux 2.2 kernels. On the WWW site www.coda.cs.cmu.edu you will find
additional resources such as mailing lists, manuals and research papers.

You might also like