5.distributed File System
5.distributed File System
1
Contents
Introduction
File Service Architecture
Napster and its Legacy
Peer-to-Peer Systems
Peer-to-Peer Middleware
Routing Overlays
Coordination and Agreement
Distributed Mutual Exclusion
Elections
Multicast Communication
2
Storage Models
Manual file
Centralized Distributed Decentralized
system –
file system file system file system
paper ledgers
3
Computing Evolution
Si
n
Data size: small gl • Single-core, single processor
Pipelined Instruction level • Single-core, multi-processor
e-
c • Multi-core, single processor
Concurrent Thread level Multi-core
or • Multi-core, multi-processor
e
• Cluster of processors (single or multi-core) with
Service Object level Cluster shared memory
• Cluster of processors with distributed memory
4
Traditional Storage Solutions
RAID: Redundant
NAS: Network SAN: Storage area
Array of
Accessible Storage networks
Inexpensive Disks
5
What is a DFS?
A DFS enables programs to store and access remote files /storage exactly
as they do local ones.
The performance and reliability of such access should be comparable to
that for files stored locally.
Recent advances in higher bandwidth connectivity of switched local
networks and disk organization have lead high performance and highly
scalable file systems.
Functional requirements: open, close, read, write, access control,
directory organization, ..
Non-functional requirements: scalable, fault-tolerant, secure,
6
File system modules
Best practice #1: When designing systems think in terms of modules of functionality.
Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5
© Pearson Education 2012
7
File attribute record structure
File length
Creation timestamp
Read timestamp
Write timestamp
Attribute timestamp
Reference count
Owner
File type
Access control list
Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5
© Pearson Education 2012
8
Different file storage services
9
UNIX file system operations
10
Distributed File System Requirements
Many of the requirements of distributed services were lessons
learned from distributed file service.
First needs were: access transparency and location transparency.
Later on, performance, scalability, concurrency control, fault
tolerance and security requirements emerged and were met in the
later phases of DFS development.
Distributed file system is implemented using client/server model.
11
Transparency
Access transparency: Client programs should be unaware of the distribution of
files.
Location transparency: Client program should see a uniform namespace. Files
should be able to be relocated without changing their path name.
Symbolic links
Cygwin is an example of unix like interface to Windows; it uses symbolic links extensively.
Mobility transparency: Neither client programs nor system admin program
tables in the client nodes should be changed when files are moved either
automatically or by the system admin.
Performance transparency: Client programs should continue to perform well on
load within a specified range.
Scaling transparency: increase in size of storage and network size should be
transparent
12
Other Requirements
Concurrent file updates is protected (record locking).
File replication to allow performance.
Hardware and operating system heterogeneity.
Fault tolerance
Consistency : Unix uses on-copy update semantics. This may be
difficult to achieve in DFS.
Security
Efficiency
13
General File Service Architecture
The responsibilities of a DFS are typically distributed among three modules:
Client module which emulates the conventional file system interface
Server modules(2) which perform operations for clients on directories and
on files.
Most importantly this architecture enables stateless implementation of the
server modules.
Our approach to design of distributed system:
Architecture,
API,
Protocols,
Implementation
14
File service architecture model
An architecture that offers a clear separation of the main concerns in providing access to files is obtained by
structuring the file service as three components:
• A flat file service
• A directory service
• A client module.
Client computer Server computer
15
ting operations on the contents of files. Unique file identifiers (UFIDs) are used to refer to files in all requests for flat file service operations. With the use of UFIDs each file has a uniqu
ce to a flat file service.
18
Directory service Interface
19
Network File System
The Network File System (NFS) was developed to allow machines to
mount a disk partition on a remote machine as if it were on a local
hard drive. This allows for fast, seamless sharing of files across a
network.
20
NFS architecture
Client computer Server computer
Application Application
program program
UNIX
system calls
UNIX kernel
UNIX kernel Virtual file system Virtual file system
Local Remote
client server
system system
NFS
protocol
21
NFS server operations (simplified) – 1
lookup(dirfh, name) -> fh, attr Returns file handle and attributes for the file name in the directory
dirfh.
create(dirfh, name, attr) -> Creates a new file name in directory dirfh with attributes attr and
newfh, attr returns the new file handle and attributes.
remove(dirfh, name) status Removes file name from directory dirfh.
getattr(fh) -> attr Returns file attributes of file fh. (Similar to the UNIX stat system
call.)
Sets the attributes (mode, user id, group id, size, access time and
setattr(fh, attr) -> attr
modify time of a file). Setting the size to 0 truncates the file.
read(fh, offset, count) -> attr, data Returns up to count bytes of data from a file starting at offset.
Also returns the latest attributes of the file.
write(fh, offset, count, data) -> attr Writes count bytes of data to a file starting at offset. Returns the
attributes of the file after the write has taken place.
rename(dirfh, name, todirfh, toname) Changes the name of file name in directory dirfh to toname in
-> status directory to todirfh
.
link(newdirfh, newname, dirfh, name) Creates an entry newname in the directory newdirfh which refers to
-> status file name in the directory dirfh. 22
NFS server operations (simplified) – 2
symlink(newdirfh, newname, string) Creates an entry newname in the directory newdirfh of type
-> status symbolic link with the value string. The server does not interpret
the string but makes a symbolic link file to hold it.
readlink(fh) -> string Returns the string that is associated with the symbolic link file
identified by fh.
mkdir(dirfh, name, attr) -> Creates a new directory name with attributes attr and returns the
newfh, attr new file handle and attributes.
rmdir(dirfh, name) -> status Removes the empty directory name from the parent directory dirfh.
Fails if the directory is not empty.
readdir(dirfh, cookie, count) -> Returns up to count bytes of directory entries from the directory
entries dirfh. Each entry contains a file name, a file handle, and an opaque
pointer to the next directory entry, called a cookie. The cookie is
used in subsequent readdir calls to start reading from the following
entry. If the value of cookie is 0, reads from the first entry in the
directory.
statfs(fh) -> fsstats Returns file system information (such as block size, number of
free blocks and so on) for the file system containing a file fh. 23
NFS Overview
Remote Procedure Calls (RPC) for communication between client and server
Client Implementation
Provides transparent access to NFS file system
UNIX contains Virtual File system layer (VFS)
Vnode: interface for procedures on an individual file
Translates vnode operations to NFS RPCs
Server Implementation
Stateless: Must not have anything only in memory
Implication: All modified data written to stable storage before return control to
client
Servers often add NVRAM to improve performance
24
Mapping UNIX System Calls to NFS Operations
Unix system call: fd = open(“/dir/foo”)
Traverse pathname to get filehandle for foo
▪ dirfh = lookup(rootdirfh, “dir”);
▪ fh = lookup(dirfh, “foo”);
Record mapping from fd file descriptor to fh NFS filehandle
Set initial file offset to 0 for fd
Return fd file descriptor
Unix system call: read(fd,buffer,bytes)
Get current file offset for fd
Map fd to fh NFS filehandle
Call data = read(fh, offset, bytes) and copy data into buffer
Increment file offset by bytes
Unix system call: close(fd)
Free resources assocatiated with fd 25
NFS Overview
The file identifiers used in NFS are called file handles. In UNIX implementations
of NFS, the file handle is derived from the file’s i-node number by adding two
extra fields as follows (the i-node number of a UNIX file is a number .
28
Client-side Caching
Caching needed to improve performance
Reads: Check local cache before going to server
Writes: Only periodically write-back data to server
Avoid contacting server
▪ Avoid slow communication over network
▪ Server becomes scalability bottleneck with more clients
Two client caches
data blocks
attributes (metadata)
29
Cache Consistency
Problem: Consistency across multiple copies (server and multiple
clients)
How to keep data consistent between client and server?
▪ If file is changed on server, will client see update?
▪ Determining factor: Read policy on clients
30
NFS Consistency: Reads
Reads: How does client keep current with server state?
Attribute cache: Used to determine when file changes
File open: Client checks server to see if attributes have changed
If haven’t checked in past T seconds (configurable, Ex: T=3)
Discard entries every N seconds (configurable, Ex: N=60)
Data cache
▪ Discard all blocks of file if attributes show file has been modified
Eg: Client cache has file A’s attributes and blocks 1, 2, 3
Client opens A:
Client reads block 1
Client waits 70 seconds
Client reads block 2
Block 3 is changed on server
Client reads block 3
Client reads block 4
Client waits 70 seconds
Client reads block 1 31
NFS Consistency: Writes
Writes: How does client update server?
Files
Write-back from client cache to server every 30 seconds
Also, Flush on close()
Directories
Synchronously write to server
Example: Client X and Y have file A (blocks 1,2,3) cached
Clients X and Y open file A
Client X writes to blocks 1 and 2
Client Y reads block 1
30 seconds later...
Client Y reads block 2
40 seconds later...
Client Y reads block 1
32
NFS Architecture
Allows an arbitrary collection of clients and servers to share a
common file system.
In many cases all servers and clients are on the same LAN but this is
not required.
NFS allows every machine to be a client and server at the same time
Each NFS server exports one or more directories for access by
remote clients.
33
NFS Protocol
One of the goals of NFS is to support a heterogeneous system, with
clients and servers running different operating systems on different
hardware. It is essential the interface between clients and server be
well defined.
NFS accomplishes this goal by defining two client-server protocol:
one for handling mounting and another for directory and file
access.
Protocol defines requests by clients and responses by servers.
34
Local and remote file systems accessible on an NFS client
Server 1 Client Server 2
(root) (root) (root)
Remote Remote
people students x staff users
mount mount
Note: The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in
Server 1;
the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2.
35
Mounting
36
File Access
NFS supports most unix operations except open and close. This is
to satisfy the “statelessness” on the server end. Server need not
keep a list of open connections. (On the other hand consider your
database connection… you create an object, connection is opened
etc.)
37
Implementation
After the usual system call layer, NFS specific layer Virtual File
System (VFS) maintains an entry per file called vnode (virtual I-node)
for every open file.
Vnode indicate whether a file is local or remote.
For remote files extra info is provided.
For local file, file system and I-node are specified.
Lets see how to use v-nodes using a mount, open, read system calls from a
client application.
38
Vnode use
To mount a remote file system, the sys admin (or /etc/rc) calls the
mount program specifying the remote directory, local directory in
which to be mounted, and other info.
If the remote directory exist and is available for mounting, mount
system call is made.
Kernel constructs vnode for the remote directory and asks the NFS-
client code to create a r-node (remote I-node) in its internal tables. V-
node in the client VFS will point to local I-node or this r-node.
39
Remote File Access
When a remote file is opened by the client, it locates the r-node.
It then asks NFS Client to open the file. NFS file looks up the path
in the remote file system and return the file handle to VFS tables.
The caller (application) is given a file descriptor for the remote
file. No table entries are made on the server side.
Subsequent reads will invoke the remote file, and for efficiency
sake the transfers are usually in large chunks (8K).
40
Server Side of File Access
41
Peer-to-Peer Systems
Peer to peer is an approach to computer networking where all
computers share equivalent responsibility for processing data. Peer-
to-peer networking (also known simply as peer networking) differs
from client-server networking, where certain devices have
responsibility for providing or "serving" data and other devices
consume or otherwise act as "clients" of those servers.
The goal of peer-to-peer systems is to enable the sharing of
data and resources on a very large scale by eliminating any
requirement for separately managed servers and their
associated infrastructure. 42
Peer-to-Peer Systems
Goal Of Peer-to-peer Systems
Peer-to-peer systems aim to support useful distributed services and applications using data and
computing resources available in the personal computers and workstations that are present in the
Internet and other networks in ever-increasing numbers.
An alternative to the client/server model of distributed computing is the peer-to-peer model.
Client/server is naturally hierarchical, with resources centralized on a limited number of servers.
In peer-to-peer networks, both resources and control are widely distributed among nodes that are
theoretically equals. (A node with more information, better information, or more power may be “more
equal,” but that is a function of the node, not the network controllers.)
Robustness, availability of information and fault-tolerance tends to come from redundancy and shared
responsibility instead of planning, organization and the investment of a controlling authority.
Peer-to-peer applications provide better communication for ‘the applications which exploit resources
available at the edges of the Internet such as storage, cycles, content, human presence’.
43
Client –Server Vs Peer-to-Peer
44
Characteristics
45
Advantages of Peer-to-peer networking over Client –Server networking are :-
It is easy to install and so is the configuration of computers on
this network,
All the resources and contents are shared by all the peers, unlike server-client
Architecture where Server shares all the contents and resources.
P2P is more reliable as central dependency is eliminated. Failure of one peer
doesn’t affect the functioning of other peers. In case of Client –Server
network, if server goes down whole network gets affected.
There is no need for full-time System Administrator. Every user is the
administrator of his machine. User can control their shared resources.
The over-all cost of building and maintaining this type of network is
comparatively very less.
46
Disadvantages (drawbacks) of Peer to peer architecture over Client Server
are:-
Security in this system is very less viruses, spywares, trojans, etc malwares can easily
transmitted over this P-2-P architecture.
Data recovery or backup is very difficult. Each computer should have its own back-up
system.
Lot of movies, music and other copyrighted files are transferred using this type of
file transfer. P2P is the technology used in torrents.
47
Applications
Theory
Dynamic discovery of information
Better utilization of bandwidth, processor, storage, and other resources
Each user contributes resources to network
Practical examples
Sharing browser cache over 100Mbps lines
Disk mirroring using spare capacity
Deep search beyond the web
48
Features
☻Peer-to-peer middleware: The third generation is characterized by the
emergence of middleware layers for the application-independent
management of distributed resources on a global scale. Several research
teams have now completed the development, evaluation and refinement
of peer-to-peer middleware platforms and demonstrated or deployed
them in a range of application services.
☻Routing Overlays: Routing overlays share many characteristics with the
IP packet routing infrastructure that constitutes the primary
communication mechanism of the Internet. It is therefore legitimate to
ask why an additional application-level routing mechanism is required in
peer-to-peer systems. 49
Napster and its legacy
51
Lessons learned from Napster
Napster demonstrated the feasibility of building a useful large-
scale service that depends almost wholly on data and computers
owned by ordinary Internet users.
To avoid swamping the computing resources of individual users (for
example, the first user to offer a chart-topping song) and their
network connections, Napster took account of network locality –
the number of hops between the client and the server – when
allocating a server to a client requesting a song.
This simple load distribution mechanism enabled the service to
scale to meet the needs of large numbers of users.
52
Limitations
Napster used a (replicated) unified index of all available music files. For the
application in question, the requirement for consistency between the
replicas was not strong, so this did not hamper performance, but for many
applications it would constitute a limitation.
Application dependencies: Napster took advantage of the special
characteristics of the application for which it was designed in other ways:
Music files are never updated, avoiding any need to make sure all the replicas of files
remain consistent after updates.
No guarantees are required concerning the availability of individual files – if a music
file is temporarily unavailable, it can be downloaded later. This reduces the
requirement for dependability of individual computers and their connections to the
Internet. 53
Peer-to-Peer middleware
Peer-to-Peer Middleware is to provide mechanism to access data
resources anywhere in network. A key problem in Peer-to-Peer
applications is to provide a way for clients to access data resources
efficiently.
Similar needs in client/server technology led to solutions like NFS.
However, NFS relies on pre- configuration and is not scalable enough
for peer-to-peer.
54
Functional & Non-Functional Requirements
Functional Requirements :
Simplify construction of services across many hosts in wide network
Add and remove resources at will
Interface to application programmers should be simple and independent of types of
distributed resources.
Non-Functional Requirements :
Global Scalability: peer-to-peer applications are to exploit the hardware resources of very
large numbers of hosts connected to the Internet.
Load Balancing: The performance of any system is designed to exploit a large number of
computers depends upon the balanced distribution of workload across them. This will be
achieved by a random placement of resources together with the use of replicas of heavily
used resources.
55
Functional & Non-Functional Requirements
Optimization for local interactions between neighboring peers: The ‘network
distance’ between nodes that interact has a substantial impact on the latency of
individual interactions, such as client requests for access to resources. The middleware
should aim to place resources close to the nodes that access them the most.
Accommodation to highly dynamic host availability: Most peer-to-peer systems are
constructed from host computers that are free to join or leave the system at any time.
Security of data in an environment with heterogeneous trust: Security of data in an
environment simplify construction of services across many hosts in wide network
Anonymity, deniability and resistance to restrict.
56
Routing Overlays
Routing overlays share many characteristics with the IP packet routing
infrastructure that constitutes the primary communication mechanism
of the Internet. It is therefore legitimate to ask why an additional
application-level routing mechanism is required in peer-to-peer systems.
1. A routing overlay is a distributed algorithm for a middleware layer responsible for routing
requests from any client to a host that holds the object to which the request is addressed.
2. Any node can access any object by routing each request through a sequence of nodes,
exploiting knowledge at each of them to locate the destination object.
3. Global User IDs (GUID) also known as opaque identifiers are used as names, but do not
contain location information.
4. A client wishing to invoke an operation on an object submits a request including the object’s
GUID to the routing overlay, which routes the request to a node at which a replica of the
object resides 57
58
IP Vs Peer-to-Peer Route overlay
59
Coordination and Agreement
Collection of algorithms are used to provide a communication with the help of peer-to- peer
process in distributed systems.
A set of processes to coordinate their actions or to agree on one or more values.
The reason for avoiding fixed master-slave relationships is that we often require our systems to keep working
correctly even if failures occur, so we need to avoid single points of failure, such as fixed masters.
A failure model is another important in distributed system. This begin by considering some algorithms that
tolerate no failures and progress through benign failures before exploring how to tolerate arbitrary failures..
Coordination and agreement relates to group communication, is the ability to multicast a message to a
group very useful in communication paradigm, with applications from locating resources to coordinating the
updates to replicated data. It examines multicast reliability and ordering semantics, and gives algorithms to
achieve the variations.
60
Failure assumptions and failure detectors
Failure assumptions
The fundamental network components may suffer failures by a reliable
communication protocol – for example, by retransmitting missing or corrupted
messages. Also for the sake of simplicity, we assume that no process failure
implies a threat to the other processes’ ability to communicate. This means that
none of the processes depends upon another to forward messages.
An unreliable failure detector may produce one of two values to identity the process: Unsuspected or
Suspected. Both of these results are May or may not accurately reflect whether the process has actually
failed.
A result of Unsuspected signifies that the detector has recently received evidence suggesting that the
process has not failed. For example, a message was recently received from it. But of course, the process
may have failed since then.
62
Failure assumptions and failure detectors
A result of Suspected signifies that the failure detector has some indication that
the process may have failed. For example, it may be that no message from the
process has been received for more than a nominal maximum length of silence
(even in an asynchronous system, practical upper bounds can be used as hints).
A reliable failure detector is one that is always accurate in detecting
a process’s failure. A result of Failed means that the detector has
determined that the process has crashed.
Thus, a failure detector may sometimes give different responses to
different processes, since communication conditions vary from
process to process.
63
Distributed mutual exclusion
65
Performance of Mutual Exclusion algorithms
The performance of algorithms for mutual exclusion according to
the following criteria:
Bandwidth consumption, which is proportional to the number of
messages, sent in each entry and exit operations.
The client delay incurred by a process at each entry and exit
operation.
Throughput of the system. Rate at which the collection of processes
as a whole can access the critical section. Measure the effect using the
synchronization delay between one process exiting the critical
section and the next process entering it; the shorter the delay is, the
greater the throughput. 66
The central server algorithm
The simplest way to achieve mutual exclusion is to employ a server that grants
permission to enter the critical section.
A process sends a request message to server and awaits a reply from it.
If a reply constitutes a token signifying the permission to enter the critical section.
If no other process has the token at the time of the request, then the server replied
immediately with the token.
If token is currently held by other processes, the server does not reply but queues the
request.
Client on exiting the critical section, a message is sent to server, giving it back the
token.
67
The central server algorithm
ME1: safety
ME2: liveness
Are satisfied but not
ME3: ordering
Bandwidth: entering takes two
messages( request followed by a
grant), delayed by the round-trip
time; exiting takes one release
message, and does not delay the
exiting process.
Throughput is measured by
synchronization delay, the round-trip
of a complete cycle.
68
Ring-based Algorithm
Simplest way to arrange mutual exclusion between N processes without
requiring an additional process is arrange them in a logical ring.
Each process pi has a communication channel to the next process in the
ring, p(i+1)/mod N.
The unique token is in the form of a message passed from process to
process in a single direction clockwise.
If a process does not require entering the CS when it receives the token,
then it immediately forwards the token to its neighbor.
A process requires the token waits until it receives it, but retains it.
To exit the critical section, the process sends the token on to its neighbor
69
Ring-based Algorithm
ME1: safety
p
ME2: liveness
1 p Are satisfied but not
2
pn
ME3: ordering
Bandwidth:continuously consumes the
p
3 bandwidth except when a process is
inside the CS. Exit only requires one
message
p Delay: experienced by process is zero
4
There is at least one common member of any two voting sets, the size of all voting set are the same size to
be fair.
The optimal solution to minimizes K is K~sqrt(N) and M=K. 74
Fault tolerance
The main points to consider when evaluating the above algorithms with respect to fault
tolerance are:
What happens when messages are lost?
What happens when a process crashes?
None of the algorithm that we have described would tolerate the loss of messages if the channels were
unreliable.
The ring-based algorithm cannot tolerate any single process crash failure.
Maekawa’s algirithm can tolerate some process crash failures: if a crashed process is not in a
voting set that is required.
The central server algorithm can tolerate the crash failure of a client process that neither holds
nor has requested the token.
The Ricart and Agrawala algorithm as we have described it can be adapted to tolerate the crash
failure of such a process by taking it to grant all requests implicitly.
75
Elections
Algorithm to choose a unique process to play a particular role is called an
election algorithm. E.g. central server for mutual exclusion, one process will be
elected as the server. Everybody must agree. If the server wishes to retire, then
another election is required to choose a replacement.
Requirements:
E1(safety): a participant pi has electedi = ^ or electedi = P,
.
Where P is chosen as the non-crashed process at the end of run with the largest identifier
76
Elections
▪ A ring based election algorithm:
▪ All processes arranged in a logical ring.
▪ Each process has a communication channel to the next process.
▪ All messages are sent clockwise around the ring.
▪ Assume that no failures occur, and system is asynchronous.
▪ Goal is to elect a single process coordinator which has the largest identifier.
3
17
24
15
24
28 77
A ring based election algorithm
1. Initially, every process is marked as non-participant. Any process can begin an election.
2. The starting process marks itself as participant and place its identifier in a message to its neighbour.
3. A process receives a message and compare it with its own. If the arrived identifier is larger, it passes on the
message.
4. If arrived identifier is smaller and receiver is not a participant, substitute its own identifier in the message and
forward if. It does not forward the message if it is already a participant.
5. On forwarding of any case, the process marks itself as a participant
6. If the received identifier is that of the receiver itself, then this process’ s identifier must be the greatest, and it
becomes the coordinator.
7. The coordinator marks itself as non-participant set elected_i and sends an elected message to its neighbour
enclosing its ID.
8. When a process receives elected message, marks itself as a non-participant, sets its variable elected_i and
forwards the message.
9. E1 is met. All identifiers are compared, since a process must receive its own ID back before sending an elected
message.
10.E2 is also met due to the guaranteed traversals of the ring.
11. Tolerate no failure makes ring algorithm of limited practical use. 78
A ring based election algorithm
The election was started by process 17.
The highest process identifier encountered so far is 24.
Participant processes are shown darkened
If only a single process starts an election, the worst-performance
case is then the anti-clockwise neighbour has the highest identifier.
A total of N-1 messages is used to reach this neighbour. Then
further N messages are required to announce its election.
The elected message is sent N times. Making 3N-1 messages in all.
Turnaround time is also 3N-1 sequential message transmission time
79
The bully algorithm
Allows processes to crash during an election, although it assumes that
message delivery between processes is reliable.
Assume system is synchronous to use timeouts to detect a process failure.
Assume each process knows which processes have higher identifiers and
that it can communicate with all such processes.
Three types of messages:
Election is sent to announce an election message. A process begins an election when
it notices, through timeouts, that the coordinator has failed. T=2Ttrans+Tprocess
From the time of sending
Answer is sent in response to an election message.
Coordinator is sent to announce the identity of the elected process.
80
The bully algorithm
81
The bully algorithm
The process begins a election by sending an election message to these processes that have a higher ID and awaits an
answer in response.
If no one arrives within time T, the process considers itself the coordinator and sends coordinator message to all
processes with lower identifiers. Otherwise, it waits a further time T’ for coordinator message to arrive. If none, begins
another election.
If a process receives a coordinator message, it sets its variable elected_i to be the coordinator ID.
If a process receives an election message, it send back an answer message and begins another election unless it has
begun one already.
E1 may be broken if timeout is not accurate or replacement. (suppose P3 crashes and replaced by another process. P2
set P3 as coordinator and P1 set P2 as coordinator)
E2 is clearly met by the assumption of reliable transmission.
Best case the process with the second highest ID notices the coordinator’s failure. Then it can immediately elect itself
and send N-2 coordinator messages.
The bully algorithm requires O(N^2) messages in the worst case - that is, when the process with the least ID first
detects the coordinator’s failure. For then N-1 processes altogether begin election, each sending messages to processes
with higher ID. 82
Multicast communication
A multicast operation is an operation that sends a single message from one process to each of the
members of a group of processes. In such a way, the membership of the group is transparent to the
sender.
Multicast messages provide a useful infrastructure for constructing distributed systems with the
following characteristics:
▪ Fault tolerance based on replicated services: A replicated service consists of a group of servers. Client requests are
multicast to all the members of the group, each of which performs an identical operation. Even when some of the
members fail, clients can still be served.
▪ Discovering services in spontaneous networking: It defines service discovery in the context of spontaneous
networking. Multicast messages can be used by servers and clients to locate available discovery services in order
to register their interfaces or to look up the interfaces of other services in the distributed system.
▪ Better performance through replicated data: Data are replicated to increase the performance of a service – in some
cases replicas of the data are placed in users’ computers. Each time the data changes, the new value are multicast
to the processes managing the replicas.
▪ Propagation of event notifications: Multicast to a group may be used to notify processes when something
happens. For example, in Face book, when someone changes their status, all their friends receive notifications.
Similarly, publish subscribe protocols may make use of group multicast to disseminate events to subscribers. 83
IP multicast – An implementation of multicast communication
▪ Local Network Control Block (224.0.0.0 to 224.0.0.225), for multicast traffic within a given local network.
▪ Internet Control Block (224.0.1.0 to 224.0.1.225).
▪ Ad Hoc Control Block (224.0.2.0 to 224.0.255.0), for traffic that does not fit any other block.
▪ Administratively Scoped Block (239.0.0.0 to 239.255.255.255), which is used to implement a scoping mechanism for multicast
traffic (to constrain propagation).
Multicast addresses may be permanent or temporary. Permanent groups exist even when there are no members i.e., their addresses
are assigned by IANA and span the various blocks mentioned above
85
IP multicast – An implementation of multicast
communication
Failure model for multicast data grams • Data grams multicast over IP multicast have the same
failure characteristics as UDP data grams – that is, they suffer from omission failures. The effect on
a multicast is that messages are not guaranteed to be delivered to any particular group member in
the face of even a single omission failure. This can be called unreliable multicast, because it does
not guarantee that a message will be delivered to any member of a group.
Java API to IP multicast : The Java API provides a datagram interface to IP multicast through the
class MulticastSocket, which is a subclass of DatagramSocket with the additional capability of being
able to join multicast groups. The class MulticastSocket provides two alternative constructors,
allowing sockets to be created to use either a specified local port or any free local port.
A process can join a multicast group with a given multicast address by invoking the joinGroup
method of its multicast socket. Effectively, the socket joins a multicast group at a given port and it
will receive datagrams sent by processes on other computers to that group at that port. A process
can leave a specified group by invoking the leaveGroup method of its multicast socket. 86