0% found this document useful (0 votes)
53 views86 pages

5.distributed File System

Uploaded by

satya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views86 pages

5.distributed File System

Uploaded by

satya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

5.

Distributed File System

1
Contents
 Introduction
 File Service Architecture
 Napster and its Legacy
 Peer-to-Peer Systems
 Peer-to-Peer Middleware
 Routing Overlays
 Coordination and Agreement
 Distributed Mutual Exclusion
 Elections
 Multicast Communication

2
Storage Models

Manual file
Centralized Distributed Decentralized
system –
file system file system file system
paper ledgers

3
Computing Evolution
Si
n
Data size: small gl • Single-core, single processor
Pipelined Instruction level • Single-core, multi-processor
e-
c • Multi-core, single processor
Concurrent Thread level Multi-core
or • Multi-core, multi-processor
e
• Cluster of processors (single or multi-core) with
Service Object level Cluster shared memory
• Cluster of processors with distributed memory

Indexed File level Grid of clusters

Mega Block level


Embarrassingly parallel processing
Virtual System Level
MapReduce, distributed file system
Data size: large
Cloud computing: google, box, aws S3

4
Traditional Storage Solutions

Off system/online File system


Offline/ tertiary
storage/ secondary abstraction/
memory/ DFS
memory Databases

RAID: Redundant
NAS: Network SAN: Storage area
Array of
Accessible Storage networks
Inexpensive Disks

5
What is a DFS?

 A DFS enables programs to store and access remote files /storage exactly
as they do local ones.
 The performance and reliability of such access should be comparable to
that for files stored locally.
 Recent advances in higher bandwidth connectivity of switched local
networks and disk organization have lead high performance and highly
scalable file systems.
 Functional requirements: open, close, read, write, access control,
directory organization, ..
 Non-functional requirements: scalable, fault-tolerant, secure,

6
File system modules

Directory module: relates file names to file IDs

File module: relates file IDs to particular files

Access control module: checks permission for operation requested

File access module: reads or writes file data or attributes

Block module: accesses and allocates disk blocks

Device module: disk I/O and buffering

Best practice #1: When designing systems think in terms of modules of functionality.

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5
© Pearson Education 2012
7
File attribute record structure
File length
Creation timestamp
Read timestamp
Write timestamp
Attribute timestamp
Reference count
Owner
File type
Access control list

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5
© Pearson Education 2012
8
Different file storage services

9
UNIX file system operations

10
Distributed File System Requirements
Many of the requirements of distributed services were lessons
learned from distributed file service.
First needs were: access transparency and location transparency.
Later on, performance, scalability, concurrency control, fault
tolerance and security requirements emerged and were met in the
later phases of DFS development.
Distributed file system is implemented using client/server model.

11
Transparency
 Access transparency: Client programs should be unaware of the distribution of
files.
 Location transparency: Client program should see a uniform namespace. Files
should be able to be relocated without changing their path name.
 Symbolic links
 Cygwin is an example of unix like interface to Windows; it uses symbolic links extensively.
 Mobility transparency: Neither client programs nor system admin program
tables in the client nodes should be changed when files are moved either
automatically or by the system admin.
 Performance transparency: Client programs should continue to perform well on
load within a specified range.
 Scaling transparency: increase in size of storage and network size should be
transparent
12
Other Requirements
 Concurrent file updates is protected (record locking).
 File replication to allow performance.
 Hardware and operating system heterogeneity.
 Fault tolerance
 Consistency : Unix uses on-copy update semantics. This may be
difficult to achieve in DFS.
 Security
 Efficiency

13
General File Service Architecture
 The responsibilities of a DFS are typically distributed among three modules:
 Client module which emulates the conventional file system interface
 Server modules(2) which perform operations for clients on directories and
on files.
 Most importantly this architecture enables stateless implementation of the
server modules.
 Our approach to design of distributed system:
Architecture,
API,
Protocols,
 Implementation
14
File service architecture model
An architecture that offers a clear separation of the main concerns in providing access to files is obtained by
structuring the file service as three components:
• A flat file service
• A directory service
• A client module.
Client computer Server computer

Application Application Directory service


program program

Flat file service


Client module

15
ting operations on the contents of files. Unique file identifiers (UFIDs) are used to refer to files in all requests for flat file service operations. With the use of UFIDs each file has a uniqu
ce to a flat file service.

File service architecture model


• A flat file service is concerned with implementing operations on the contents of files. Unique file
identifiers (UFIDs) are used to refer to files in all requests for flat file service operations. With the use
of UFIDs each file has a unique among all of the files in a distributed system.
Table contains a definition of the interface to a flat file service .
• The Directory service provides mapping between text names for the files and their UFIDs. Clients
may obtain the UFID of a file by its text name to directory service. Directory service supports
functions needed generate directories, to add new files to directories.
Table contains a definition of the interface to a directory file service.
• Client module runs on each computer and provides integrated service (flat file and directory) as a
single API to application programs. For example, in UNIX hosts, a client module emulates the full set
of Unix file operations.
A client model also holds information about the network locations of flat-file and directory server
processes and achieves better performance through implementation of a cache of recently used file
blocks at the client.
16
Flat file service Interface
Read(FileId, i, n) If->1 ≤Data
i ≤ Length(File): Reads a sequence of up to n items
—throwsBadPositionfrom a file starting at item i and returns it in Data.
Write(FileId, i, Data)If 1 ≤ i ≤ Length(File)+1: Writes a sequence of Data to a
—throwsBadPosition file, starting at item i, extending the file if necessary.
Create() -> FileId Creates a new file of length 0 and delivers a UFID for it.
Delete(FileId) Removes the file from the file store.
GetAttributes(FileId) ->Returns
Attr the file attributes for the file.
SetAttributes(FileId, Sets
Attr)the file attributes (only those attributes that are not
shaded in ).
Primary operations are reading and writing. What’s missing? How about Open and
Close?
17
File service architecture model
Access control: In distributed implementations, access rights checks have to be performed
at the server because the server RPC interface is an otherwise unprotected point of access
to files.
Directory service interface: Figure contains a definition of the RPC interface to a directory
service. The primary purpose of the directory service is to provide a service for translating
text names to UFIDs.

18
Directory service Interface

Lookup(Dir, Name) -> FileId


Locates the text name in the directory and returns the
—throws
NotFound relevant UFID. If Name is not in the directory, throws an
exception.
AddName(Dir, Name, File) If Name is not in the directory, adds (Name, File) to the
—throws
NameDuplicate directory and updates the file’s attribute record.
If Name is already in the directory: throws an exception.
UnName(Dir, Name) If Name is in the directory: the entry containing Name is
—throws
NotFound removed from the directory.
If Name is not in the directory: throws an exception.
GetNames(Dir, Pattern) -> Returns
NameSeq
all the text names in the directory that match the
regular expression Pattern.
Primary purpose is to provide a service for translation text names to UFIDs.

19
Network File System
 The Network File System (NFS) was developed to allow machines to
mount a disk partition on a remote machine as if it were on a local
hard drive. This allows for fast, seamless sharing of files across a
network.

20
NFS architecture
Client computer Server computer

Application Application
program program
UNIX
system calls
UNIX kernel
UNIX kernel Virtual file system Virtual file system
Local Remote

UNIX file system UNIX


NFS NFS
file file
Other

client server
system system
NFS
protocol

21
NFS server operations (simplified) – 1
lookup(dirfh, name) -> fh, attr Returns file handle and attributes for the file name in the directory
dirfh.
create(dirfh, name, attr) -> Creates a new file name in directory dirfh with attributes attr and
newfh, attr returns the new file handle and attributes.
remove(dirfh, name) status Removes file name from directory dirfh.
getattr(fh) -> attr Returns file attributes of file fh. (Similar to the UNIX stat system
call.)
Sets the attributes (mode, user id, group id, size, access time and
setattr(fh, attr) -> attr
modify time of a file). Setting the size to 0 truncates the file.
read(fh, offset, count) -> attr, data Returns up to count bytes of data from a file starting at offset.
Also returns the latest attributes of the file.
write(fh, offset, count, data) -> attr Writes count bytes of data to a file starting at offset. Returns the
attributes of the file after the write has taken place.
rename(dirfh, name, todirfh, toname) Changes the name of file name in directory dirfh to toname in
-> status directory to todirfh
.
link(newdirfh, newname, dirfh, name) Creates an entry newname in the directory newdirfh which refers to
-> status file name in the directory dirfh. 22
NFS server operations (simplified) – 2
symlink(newdirfh, newname, string) Creates an entry newname in the directory newdirfh of type
-> status symbolic link with the value string. The server does not interpret
the string but makes a symbolic link file to hold it.
readlink(fh) -> string Returns the string that is associated with the symbolic link file
identified by fh.
mkdir(dirfh, name, attr) -> Creates a new directory name with attributes attr and returns the
newfh, attr new file handle and attributes.
rmdir(dirfh, name) -> status Removes the empty directory name from the parent directory dirfh.
Fails if the directory is not empty.
readdir(dirfh, cookie, count) -> Returns up to count bytes of directory entries from the directory
entries dirfh. Each entry contains a file name, a file handle, and an opaque
pointer to the next directory entry, called a cookie. The cookie is
used in subsequent readdir calls to start reading from the following
entry. If the value of cookie is 0, reads from the first entry in the
directory.
statfs(fh) -> fsstats Returns file system information (such as block size, number of
free blocks and so on) for the file system containing a file fh. 23
NFS Overview
 Remote Procedure Calls (RPC) for communication between client and server
 Client Implementation
 Provides transparent access to NFS file system
UNIX contains Virtual File system layer (VFS)
Vnode: interface for procedures on an individual file
 Translates vnode operations to NFS RPCs
 Server Implementation
 Stateless: Must not have anything only in memory
 Implication: All modified data written to stable storage before return control to
client
Servers often add NVRAM to improve performance

24
Mapping UNIX System Calls to NFS Operations
 Unix system call: fd = open(“/dir/foo”)
 Traverse pathname to get filehandle for foo
▪ dirfh = lookup(rootdirfh, “dir”);
▪ fh = lookup(dirfh, “foo”);
 Record mapping from fd file descriptor to fh NFS filehandle
 Set initial file offset to 0 for fd
 Return fd file descriptor
 Unix system call: read(fd,buffer,bytes)
 Get current file offset for fd
 Map fd to fh NFS filehandle
 Call data = read(fh, offset, bytes) and copy data into buffer
 Increment file offset by bytes
 Unix system call: close(fd)
 Free resources assocatiated with fd 25
NFS Overview
 The file identifiers used in NFS are called file handles. In UNIX implementations
of NFS, the file handle is derived from the file’s i-node number by adding two
extra fields as follows (the i-node number of a UNIX file is a number .

Sun Network File System


 All implementations of NFS support the NFS protocol – a set of remote
procedure calls that provide the means for clients to perform operations on a
remote file store. The NFS protocol is operating system– independent but was
originally developed for use in networks of UNIX systems, we discuss the UNIX
implementation the NFS protocol. 26
NFS Overview
 Sun’s RPC system, was developed for use in NFS. It can be configured to use
either UDP or TCP, and the NFS protocol is compatible with both. A port
mapper service is included to enable clients to bind to services in a given host
by name.
 The RPC interface to the NFS server is open: any process can send requests to
an NFS server; if the requests are valid and they include valid user credentials,
they will be acted upon.
 The submission of signed user credentials can be required as an optional
security feature, as can the encryption of data for privacy and integrity that
serves to identify and locate the file within the file system in which the file is
stored.
27
NFS Overview
Virtual File System
The integration is achieved by a virtual file system (VFS) module,
which has been added to the UNIX kernel to distinguish between
local and remote files and to translate between the UNIX-
independent file identifiers used by NFS and the internal file
identifiers normally used in UNIX and other file systems.
 In addition, VFS keeps track of the file systems that are currently
available both locally and remotely, and it passes each request to the
appropriate local system module.

28
Client-side Caching
 Caching needed to improve performance
 Reads: Check local cache before going to server
 Writes: Only periodically write-back data to server
 Avoid contacting server
▪ Avoid slow communication over network
▪ Server becomes scalability bottleneck with more clients
 Two client caches
 data blocks
 attributes (metadata)

29
Cache Consistency
 Problem: Consistency across multiple copies (server and multiple
clients)
 How to keep data consistent between client and server?
▪ If file is changed on server, will client see update?
▪ Determining factor: Read policy on clients

 How to keep data consistent across clients?


▪ If write file on client A and read on client B, will B see update?
▪ Determining factor: Write and read policy on clients

30
NFS Consistency: Reads
 Reads: How does client keep current with server state?
 Attribute cache: Used to determine when file changes
File open: Client checks server to see if attributes have changed
If haven’t checked in past T seconds (configurable, Ex: T=3)
Discard entries every N seconds (configurable, Ex: N=60)
 Data cache
▪ Discard all blocks of file if attributes show file has been modified
 Eg: Client cache has file A’s attributes and blocks 1, 2, 3
 Client opens A:
 Client reads block 1
 Client waits 70 seconds
 Client reads block 2
 Block 3 is changed on server
 Client reads block 3
 Client reads block 4
 Client waits 70 seconds
 Client reads block 1 31
NFS Consistency: Writes
 Writes: How does client update server?
 Files
 Write-back from client cache to server every 30 seconds
 Also, Flush on close()
 Directories
 Synchronously write to server
 Example: Client X and Y have file A (blocks 1,2,3) cached
 Clients X and Y open file A
 Client X writes to blocks 1 and 2
 Client Y reads block 1
 30 seconds later...
 Client Y reads block 2
 40 seconds later...
 Client Y reads block 1
32
NFS Architecture
 Allows an arbitrary collection of clients and servers to share a
common file system.
 In many cases all servers and clients are on the same LAN but this is
not required.
 NFS allows every machine to be a client and server at the same time
 Each NFS server exports one or more directories for access by
remote clients.

33
NFS Protocol
One of the goals of NFS is to support a heterogeneous system, with
clients and servers running different operating systems on different
hardware. It is essential the interface between clients and server be
well defined.
NFS accomplishes this goal by defining two client-server protocol:
one for handling mounting and another for directory and file
access.
Protocol defines requests by clients and responses by servers.

34
Local and remote file systems accessible on an NFS client
Server 1 Client Server 2
(root) (root) (root)

export ... vmunix usr nfs

Remote Remote
people students x staff users
mount mount

big jon bob ... jim ann jane joe

Note: The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in
Server 1;
the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2.
35
Mounting

 Client requests a directory structure to be mounted, if the path is


legal the server returns file handle to the client.
 Or the mounting can be automatic by placing the directories to
mounted in the /etc/rc: automounting.

36
File Access
 NFS supports most unix operations except open and close. This is
to satisfy the “statelessness” on the server end. Server need not
keep a list of open connections. (On the other hand consider your
database connection… you create an object, connection is opened
etc.)

37
Implementation
 After the usual system call layer, NFS specific layer Virtual File
System (VFS) maintains an entry per file called vnode (virtual I-node)
for every open file.
 Vnode indicate whether a file is local or remote.
For remote files extra info is provided.
For local file, file system and I-node are specified.
Lets see how to use v-nodes using a mount, open, read system calls from a
client application.

38
Vnode use
 To mount a remote file system, the sys admin (or /etc/rc) calls the
mount program specifying the remote directory, local directory in
which to be mounted, and other info.
 If the remote directory exist and is available for mounting, mount
system call is made.
 Kernel constructs vnode for the remote directory and asks the NFS-
client code to create a r-node (remote I-node) in its internal tables. V-
node in the client VFS will point to local I-node or this r-node.

39
Remote File Access
 When a remote file is opened by the client, it locates the r-node.
 It then asks NFS Client to open the file. NFS file looks up the path
in the remote file system and return the file handle to VFS tables.
 The caller (application) is given a file descriptor for the remote
file. No table entries are made on the server side.
 Subsequent reads will invoke the remote file, and for efficiency
sake the transfers are usually in large chunks (8K).

40
Server Side of File Access

When the request message arrives at the NFS server, it is passed to


the VFS layer where the file is probably identified to be a local or
remote file.
Usually a 8K chunk is returned. Read ahead and caching are used to
improve efficiency.
Cache: server side for disk accesses, client side for I-nodes and
another for file data.
Of course this leads to cache consistency and security problem which
ties us into other topics we are discussing.

41
Peer-to-Peer Systems
 Peer to peer is an approach to computer networking where all
computers share equivalent responsibility for processing data. Peer-
to-peer networking (also known simply as peer networking) differs
from client-server networking, where certain devices have
responsibility for providing or "serving" data and other devices
consume or otherwise act as "clients" of those servers.
 The goal of peer-to-peer systems is to enable the sharing of
data and resources on a very large scale by eliminating any
requirement for separately managed servers and their
associated infrastructure. 42
Peer-to-Peer Systems
Goal Of Peer-to-peer Systems

 Peer-to-peer systems aim to support useful distributed services and applications using data and
computing resources available in the personal computers and workstations that are present in the
Internet and other networks in ever-increasing numbers.
 An alternative to the client/server model of distributed computing is the peer-to-peer model.
 Client/server is naturally hierarchical, with resources centralized on a limited number of servers.
 In peer-to-peer networks, both resources and control are widely distributed among nodes that are
theoretically equals. (A node with more information, better information, or more power may be “more
equal,” but that is a function of the node, not the network controllers.)
 Robustness, availability of information and fault-tolerance tends to come from redundancy and shared
responsibility instead of planning, organization and the investment of a controlling authority.
 Peer-to-peer applications provide better communication for ‘the applications which exploit resources
available at the edges of the Internet such as storage, cycles, content, human presence’.
43
Client –Server Vs Peer-to-Peer

44
Characteristics

1. Each user contributes resources to the system.


2. All the nodes in a peer-to-peer system have the same functional
capabilities and responsibilities.
3. Correct operation does not depend on the existence of any centrally
administered systems.
4. Limited degree of anonymity to the providers and users of
resources.
5. Efficient operation is the choice of an algorithm.

45
Advantages of Peer-to-peer networking over Client –Server networking are :-
 
It is easy to install and so is the configuration of computers on
this network,
All the resources and contents are shared by all the peers, unlike server-client
Architecture where Server shares all the contents and resources.
P2P is more reliable as central dependency is eliminated. Failure of one peer
doesn’t affect the functioning of other peers. In case of Client –Server
network, if server goes down whole network gets affected.
There is no need for full-time System Administrator. Every user is the
administrator of his machine. User can control their shared resources.
The over-all cost of building and maintaining this type of network is
comparatively very less.
46
Disadvantages (drawbacks) of Peer to peer architecture over Client Server
are:-

 In this network, the whole system is decentralized thus it is difficult to administer.


That is one person cannot determine the whole accessibility setting of whole
network.

 Security in this system is very less viruses, spywares, trojans, etc malwares can easily
transmitted over this P-2-P architecture.

 Data recovery or backup is very difficult. Each computer should have its own back-up
system.

 Lot of movies, music and other copyrighted files are transferred using this type of
file transfer. P2P is the technology used in torrents.
47
Applications

 Theory
 Dynamic discovery of information
 Better utilization of bandwidth, processor, storage, and other resources
 Each user contributes resources to network
 Practical examples
 Sharing browser cache over 100Mbps lines
 Disk mirroring using spare capacity
 Deep search beyond the web

48
Features
☻Peer-to-peer middleware: The third generation is characterized by the
emergence of middleware layers for the application-independent
management of distributed resources on a global scale. Several research
teams have now completed the development, evaluation and refinement
of peer-to-peer middleware platforms and demonstrated or deployed
them in a range of application services.
☻Routing Overlays: Routing overlays share many characteristics with the
IP packet routing infrastructure that constitutes the primary
communication mechanism of the Internet. It is therefore legitimate to
ask why an additional application-level routing mechanism is required in
peer-to-peer systems. 49
Napster and its legacy

 Downloading the digital music files is the first application which


globally scalable storage and retrieval in service. Both need and
feasibility of a peer-to-peer solution was first demonstrated by the
Napster file sharing system for users to share files. Napster
became very popular for music exchange soon after its launch in
1999.
 Napster’s architecture included centralized indexes, but users
supplied the files, which were stored and accessed on their personal
computers. Napster’s method of operation is illustrated by the
sequence of steps shown in Figure
50
Napster’s architecture

51
Lessons learned from Napster
Napster demonstrated the feasibility of building a useful large-
scale service that depends almost wholly on data and computers
owned by ordinary Internet users.
To avoid swamping the computing resources of individual users (for
example, the first user to offer a chart-topping song) and their
network connections, Napster took account of network locality –
the number of hops between the client and the server – when
allocating a server to a client requesting a song.
This simple load distribution mechanism enabled the service to
scale to meet the needs of large numbers of users.
52
Limitations
Napster used a (replicated) unified index of all available music files. For the
application in question, the requirement for consistency between the
replicas was not strong, so this did not hamper performance, but for many
applications it would constitute a limitation.
Application dependencies: Napster took advantage of the special
characteristics of the application for which it was designed in other ways:
 Music files are never updated, avoiding any need to make sure all the replicas of files
remain consistent after updates.
 No guarantees are required concerning the availability of individual files – if a music
file is temporarily unavailable, it can be downloaded later. This reduces the
requirement for dependability of individual computers and their connections to the
Internet. 53
Peer-to-Peer middleware
 Peer-to-Peer Middleware is to provide mechanism to access data
resources anywhere in network. A key problem in Peer-to-Peer
applications is to provide a way for clients to access data resources
efficiently.
 Similar needs in client/server technology led to solutions like NFS.
However, NFS relies on pre- configuration and is not scalable enough
for peer-to-peer.

54
Functional & Non-Functional Requirements
Functional Requirements :
 Simplify construction of services across many hosts in wide network
 Add and remove resources at will
 Interface to application programmers should be simple and independent of types of
distributed resources.
Non-Functional Requirements :
Global Scalability: peer-to-peer applications are to exploit the hardware resources of very
large numbers of hosts connected to the Internet.
Load Balancing: The performance of any system is designed to exploit a large number of
computers depends upon the balanced distribution of workload across them. This will be
achieved by a random placement of resources together with the use of replicas of heavily
used resources.

55
Functional & Non-Functional Requirements
 Optimization for local interactions between neighboring peers: The ‘network
distance’ between nodes that interact has a substantial impact on the latency of
individual interactions, such as client requests for access to resources. The middleware
should aim to place resources close to the nodes that access them the most.
 Accommodation to highly dynamic host availability: Most peer-to-peer systems are
constructed from host computers that are free to join or leave the system at any time.
 Security of data in an environment with heterogeneous trust: Security of data in an
environment simplify construction of services across many hosts in wide network
 Anonymity, deniability and resistance to restrict.

56
Routing Overlays
 Routing overlays share many characteristics with the IP packet routing
infrastructure that constitutes the primary communication mechanism
of the Internet. It is therefore legitimate to ask why an additional
application-level routing mechanism is required in peer-to-peer systems.
1. A routing overlay is a distributed algorithm for a middleware layer responsible for routing
requests from any client to a host that holds the object to which the request is addressed.
2. Any node can access any object by routing each request through a sequence of nodes,
exploiting knowledge at each of them to locate the destination object.
3. Global User IDs (GUID) also known as opaque identifiers are used as names, but do not
contain location information.
4. A client wishing to invoke an operation on an object submits a request including the object’s
GUID to the routing overlay, which routes the request to a node at which a replica of the
object resides 57
58
IP Vs Peer-to-Peer Route overlay

59
Coordination and Agreement

 Collection of algorithms are used to provide a communication with the help of peer-to- peer
process in distributed systems.
 A set of processes to coordinate their actions or to agree on one or more values.

 The reason for avoiding fixed master-slave relationships is that we often require our systems to keep working
correctly even if failures occur, so we need to avoid single points of failure, such as fixed masters.

A failure model is another important in distributed system. This begin by considering some algorithms that
tolerate no failures and progress through benign failures before exploring how to tolerate arbitrary failures..

 Coordination and agreement relates to group communication, is the ability to multicast a message to a
group very useful in communication paradigm, with applications from locating resources to coordinating the
updates to replicated data. It examines multicast reliability and ordering semantics, and gives algorithms to
achieve the variations.
60
Failure assumptions and failure detectors
Failure assumptions
 The fundamental network components may suffer failures by a reliable
communication protocol – for example, by retransmitting missing or corrupted
messages. Also for the sake of simplicity, we assume that no process failure
implies a threat to the other processes’ ability to communicate. This means that
none of the processes depends upon another to forward messages.

 In any particular interval of time, communication between some processes may


succeed while communication between others is delayed. For example, the failure of
a router between two networks may mean that a collection of four processes is
split into two pairs, such that intra-pair communication is possible over their
respective networks; but inter-pair communication is not possible while the router
has failed. This is known as a network partition 61
Failure assumptions and failure detectors
 Failure detectors: A failure detector is a service that processes queries about whether a
particular process has failed. It is often implemented by an object local to each process (on the
same computer) that runs a failure-detection algorithm in conjunction with its counterparts at
other processes. The object local to each process is called a local failure detector.
 some of the properties of failure detectors. A failure ‘detector’ is categorized into two types, they are
 1. Unreliable failure detectors and 2. Reliable failure detectors.

 An unreliable failure detector may produce one of two values to identity the process: Unsuspected or
Suspected. Both of these results are May or may not accurately reflect whether the process has actually
failed.

 A result of Unsuspected signifies that the detector has recently received evidence suggesting that the
process has not failed. For example, a message was recently received from it. But of course, the process
may have failed since then.
62
Failure assumptions and failure detectors
 A result of Suspected signifies that the failure detector has some indication that
the process may have failed. For example, it may be that no message from the
process has been received for more than a nominal maximum length of silence
(even in an asynchronous system, practical upper bounds can be used as hints).
A reliable failure detector is one that is always accurate in detecting
a process’s failure. A result of Failed means that the detector has
determined that the process has crashed.
 Thus, a failure detector may sometimes give different responses to
different processes, since communication conditions vary from
process to process.
63
Distributed mutual exclusion

 A collection of processes share a resource or collection of


resources, then often mutual exclusion is required to prevent
interference and ensure consistency when accessing the resources.
This is the critical section problem in the domain of operating
systems.
 In a distributed system, a solution to distributed mutual exclusion:
one that is based solely on message passing.
 In some cases shared resources are managed by servers that also
provide mechanisms for mutual exclusion. But in some practical cases,
a separate mechanism for mutual exclusion is required. 64
Algorithms for mutual exclusion
Algorithms for mutual exclusion
Algorithms for mutual exclusion

Algorithms for mutual exclusion

The application-level protocol for executing a critical section is as follows:


 enter() // enter critical section – block if necessary
 resourceAccesses() // access shared resources in critical section.
 exit() // leave critical section – other processes may now enter

Our essential requirements for mutual exclusion are as follows:


 ME1: (safety) At most one process may execute in the critical section (CS) at a time.
 ME2: (liveness) Requests to enter and exit the critical section eventually succeed.

 Condition ME2 implies freedom from both deadlock and starvation

65
Performance of Mutual Exclusion algorithms
The performance of algorithms for mutual exclusion according to
the following criteria:
 Bandwidth consumption, which is proportional to the number of
messages, sent in each entry and exit operations.
 The client delay incurred by a process at each entry and exit
operation.
 Throughput of the system. Rate at which the collection of processes
as a whole can access the critical section. Measure the effect using the
synchronization delay between one process exiting the critical
section and the next process entering it; the shorter the delay is, the
greater the throughput. 66
The central server algorithm
The simplest way to achieve mutual exclusion is to employ a server that grants
permission to enter the critical section.

A process sends a request message to server and awaits a reply from it.
If a reply constitutes a token signifying the permission to enter the critical section.
If no other process has the token at the time of the request, then the server replied
immediately with the token.
If token is currently held by other processes, the server does not reply but queues the
request.
Client on exiting the critical section, a message is sent to server, giving it back the
token.

67
The central server algorithm
ME1: safety
ME2: liveness
Are satisfied but not
ME3: ordering
Bandwidth: entering takes two
messages( request followed by a
grant), delayed by the round-trip
time; exiting takes one release
message, and does not delay the
exiting process.
Throughput is measured by
synchronization delay, the round-trip
of a complete cycle.
68
Ring-based Algorithm
 Simplest way to arrange mutual exclusion between N processes without
requiring an additional process is arrange them in a logical ring.
 Each process pi has a communication channel to the next process in the
ring, p(i+1)/mod N.
 The unique token is in the form of a message passed from process to
process in a single direction clockwise.
 If a process does not require entering the CS when it receives the token,
then it immediately forwards the token to its neighbor.
 A process requires the token waits until it receives it, but retains it.
 To exit the critical section, the process sends the token on to its neighbor
69
Ring-based Algorithm
ME1: safety
p
ME2: liveness
1 p Are satisfied but not
2

pn
ME3: ordering
Bandwidth:continuously consumes the
p
3 bandwidth except when a process is
inside the CS. Exit only requires one
message
p Delay: experienced by process is zero
4

message(just received token) to N


Throughput: synchronization delay
Token
between one exit and next entry is
anywhere from 1 to N message
70
Ricart and Agrawala’s algorithm
 On initialization
 state := RELEASED;
 To enter the section
 state := WANTED;
Multicast request to all processes; request processing deferred here
 T := request’s timestamp;
 Wait until (number of replies received = (N – 1));
 state := HELD;
 On receipt of a request <Ti, pi> at pj (i ≠ j)
 if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))
 then
  queue request from pi without replying;
 Else
 reply immediately to pi;
  end if
 To exit the critical section state := RELEASED;
 reply to any queued requests; 71
Ricart and Agrawala’s algorithm
An algorithm using multicast and logical clocks:
Ricart and Agrawala developed an algorithm to implement
 Mutual exclusion between N peer processes based upon multicast.
 Processes that require entry to a critical section multicast a request message,
and can enter it only when all the other processes have replied to this message.
 The condition under which a process replies to a request are designed to ensure
that conditions ME1, ME2 and ME3 are met.
 Each process pi keeps a Lamport clock. Message requesting entry are of the
form<T, pi>.
 Each process records its state of either RELEASE, WANTED or HELD in a
variable state. 72
Ricart and Agrawala’s algorithm

P1 and P2 request CS concurrently. The timestamp of


P1 is 41 and for P2 is 34. When P3 receives their
requests, it replies immediately. When P2 receives
P1’s request, it finds its own request has the lower
timestamp, and so does not reply, holding P1 request
in queue. However, P1 will reply. P2 will enter CS.
After P2 finishes, P2 reply P1 and P1 will enter CS.

Granting entry takes 2(N-1) messages, N-1 to


multicast request and N-1 replies.
Bandwidth consumption is high.
Client delay is again 1 round trip time
Synchronization delay is one message transmission
time. 73
Maekawa’s voting algorithm
 In 1985, Maekawa observed that in order for a process to enter a critical section,
 It is not necessary for all of its peers to grant access. Only need to obtain permission to enter from subsets of their peers, a
long as the subsets used by any two processes overlap.
 Think of processes as voting for one another to enter the CS. A candidate process must collect sufficient votes to enter.
 Processes in the intersection of two sets of voters ensure the safety property ME1 by casting their votes for only one
candidate.
 Maekawa associated a voting set Vi associated with each process pi. i.e., pi ( i = 1, 2, . . . N ),

 There is at least one common member of any two voting sets, the size of all voting set are the same size to
be fair.
 The optimal solution to minimizes K is K~sqrt(N) and M=K. 74
Fault tolerance
 The main points to consider when evaluating the above algorithms with respect to fault
tolerance are:
 What happens when messages are lost?
 What happens when a process crashes?
 None of the algorithm that we have described would tolerate the loss of messages if the channels were
unreliable.
 The ring-based algorithm cannot tolerate any single process crash failure.
 Maekawa’s algirithm can tolerate some process crash failures: if a crashed process is not in a
voting set that is required.
 The central server algorithm can tolerate the crash failure of a client process that neither holds
nor has requested the token.
 The Ricart and Agrawala algorithm as we have described it can be adapted to tolerate the crash
failure of such a process by taking it to grant all requests implicitly.
75
Elections
 Algorithm to choose a unique process to play a particular role is called an
election algorithm. E.g. central server for mutual exclusion, one process will be
elected as the server. Everybody must agree. If the server wishes to retire, then
another election is required to choose a replacement.
 Requirements:
 E1(safety): a participant pi has electedi = ^ or electedi = P,
 .
Where P is chosen as the non-crashed process at the end of run with the largest identifier

 . E2(liveness): All processes Pi participate in election process and eventually


set electedi ‡ ^ – or crash

76
Elections
▪ A ring based election algorithm:
▪ All processes arranged in a logical ring.
▪ Each process has a communication channel to the next process.
▪ All messages are sent clockwise around the ring.
▪ Assume that no failures occur, and system is asynchronous.
▪ Goal is to elect a single process coordinator which has the largest identifier.
3
17

24

15
24
28 77
A ring based election algorithm
1. Initially, every process is marked as non-participant. Any process can begin an election.
2. The starting process marks itself as participant and place its identifier in a message to its neighbour.
3. A process receives a message and compare it with its own. If the arrived identifier is larger, it passes on the
message.
4. If arrived identifier is smaller and receiver is not a participant, substitute its own identifier in the message and
forward if. It does not forward the message if it is already a participant.
5. On forwarding of any case, the process marks itself as a participant
6. If the received identifier is that of the receiver itself, then this process’ s identifier must be the greatest, and it
becomes the coordinator.
7. The coordinator marks itself as non-participant set elected_i and sends an elected message to its neighbour
enclosing its ID.
8. When a process receives elected message, marks itself as a non-participant, sets its variable elected_i and
forwards the message.
9. E1 is met. All identifiers are compared, since a process must receive its own ID back before sending an elected
message.
10.E2 is also met due to the guaranteed traversals of the ring.
11. Tolerate no failure makes ring algorithm of limited practical use. 78
A ring based election algorithm
The election was started by process 17.
The highest process identifier encountered so far is 24.
Participant processes are shown darkened
 
 If only a single process starts an election, the worst-performance
case is then the anti-clockwise neighbour has the highest identifier.
A total of N-1 messages is used to reach this neighbour. Then
further N messages are required to announce its election.
 The elected message is sent N times. Making 3N-1 messages in all.
 Turnaround time is also 3N-1 sequential message transmission time
79
The bully algorithm
 Allows processes to crash during an election, although it assumes that
message delivery between processes is reliable.
 Assume system is synchronous to use timeouts to detect a process failure.
 Assume each process knows which processes have higher identifiers and
that it can communicate with all such processes.
 Three types of messages:
 Election is sent to announce an election message. A process begins an election when
it notices, through timeouts, that the coordinator has failed. T=2Ttrans+Tprocess
From the time of sending
 Answer is sent in response to an election message.
 Coordinator is sent to announce the identity of the elected process.
80
The bully algorithm

81
The bully algorithm
 The process begins a election by sending an election message to these processes that have a higher ID and awaits an
answer in response.
 If no one arrives within time T, the process considers itself the coordinator and sends coordinator message to all
processes with lower identifiers. Otherwise, it waits a further time T’ for coordinator message to arrive. If none, begins
another election.
 If a process receives a coordinator message, it sets its variable elected_i to be the coordinator ID.
 If a process receives an election message, it send back an answer message and begins another election unless it has
begun one already.

 E1 may be broken if timeout is not accurate or replacement. (suppose P3 crashes and replaced by another process. P2
set P3 as coordinator and P1 set P2 as coordinator)
 E2 is clearly met by the assumption of reliable transmission.

 Best case the process with the second highest ID notices the coordinator’s failure. Then it can immediately elect itself
and send N-2 coordinator messages.
 The bully algorithm requires O(N^2) messages in the worst case - that is, when the process with the least ID first
detects the coordinator’s failure. For then N-1 processes altogether begin election, each sending messages to processes
with higher ID. 82
Multicast communication
 A multicast operation is an operation that sends a single message from one process to each of the
members of a group of processes. In such a way, the membership of the group is transparent to the
sender.
 Multicast messages provide a useful infrastructure for constructing distributed systems with the
following characteristics:
▪ Fault tolerance based on replicated services: A replicated service consists of a group of servers. Client requests are
multicast to all the members of the group, each of which performs an identical operation. Even when some of the
members fail, clients can still be served.
▪ Discovering services in spontaneous networking: It defines service discovery in the context of spontaneous
networking. Multicast messages can be used by servers and clients to locate available discovery services in order
to register their interfaces or to look up the interfaces of other services in the distributed system.
▪ Better performance through replicated data: Data are replicated to increase the performance of a service – in some
cases replicas of the data are placed in users’ computers. Each time the data changes, the new value are multicast
to the processes managing the replicas.
▪ Propagation of event notifications: Multicast to a group may be used to notify processes when something
happens. For example, in Face book, when someone changes their status, all their friends receive notifications.
Similarly, publish subscribe protocols may make use of group multicast to disseminate events to subscribers. 83
IP multicast – An implementation of multicast communication

IP multicast – An implementation of multicast communication

For group communication, different group communication protocols


were used. In addition to these IP multicast is provided which
presents Java’s API to it through the Multicast Socket class.

IP multicast: IP multicast is built on top of the Internet Protocol (IP).


Note that IP packets are addressed to computers i.e., ports belonging
to the TCP and UDP levels. IP multicast allows the sender to transmit
a single IP packet to a set of computers that form a multicast group.
The sender is unaware of the identities of the individual recipients
and of the size of the group.
84
IP multicast – An implementation of multicast
communication
 At the IP level, a computer belongs to a multicast group when one or more of its processes have sockets that belong to that group.
When a multicast message arrives at a computer, copies are forwarded to all of the local sockets that have joined the specified
multicast address and are bound to the specified port number. The following details are specific to IPv4:
 
 Multicast routers: IP packets can be multicast both on a local network and on the wider Internet. Local multicasts use the multicast
capability of the local network, for example, of an Ethernet. Internet multicasts make use of multicast routers, which forward single
data grams to routers on other networks, where they are again multicast to local members.
 Multicast address allocation: Class D addresses (that is, addresses in the range 224.0.0.0 to 239.255.255.255) are reserved for multicast
traffic and managed globally by the Internet Assigned Numbers Authority (IANA). The management of this address space is reviewed
annually, with current practice documented in RPC 3171. This document defines a partitioning of this address space into a number of
blocks, including:

▪ Local Network Control Block (224.0.0.0 to 224.0.0.225), for multicast traffic within a given local network.
▪ Internet Control Block (224.0.1.0 to 224.0.1.225).
▪ Ad Hoc Control Block (224.0.2.0 to 224.0.255.0), for traffic that does not fit any other block.
▪ Administratively Scoped Block (239.0.0.0 to 239.255.255.255), which is used to implement a scoping mechanism for multicast
traffic (to constrain propagation).
 
 Multicast addresses may be permanent or temporary. Permanent groups exist even when there are no members i.e., their addresses
are assigned by IANA and span the various blocks mentioned above
85
IP multicast – An implementation of multicast
communication
 Failure model for multicast data grams • Data grams multicast over IP multicast have the same
failure characteristics as UDP data grams – that is, they suffer from omission failures. The effect on
a multicast is that messages are not guaranteed to be delivered to any particular group member in
the face of even a single omission failure. This can be called unreliable multicast, because it does
not guarantee that a message will be delivered to any member of a group.
 
 Java API to IP multicast : The Java API provides a datagram interface to IP multicast through the
class MulticastSocket, which is a subclass of DatagramSocket with the additional capability of being
able to join multicast groups. The class MulticastSocket provides two alternative constructors,
allowing sockets to be created to use either a specified local port or any free local port.

 A process can join a multicast group with a given multicast address by invoking the joinGroup
method of its multicast socket. Effectively, the socket joins a multicast group at a given port and it
will receive datagrams sent by processes on other computers to that group at that port. A process
can leave a specified group by invoking the leaveGroup method of its multicast socket. 86

You might also like