PDS Merged
PDS Merged
By
Tarun Biswas
Assistant Professor
1. Chapter 1
• What is Distributed System: Distributed computing is decentralised and parallel computing, using two
or more computional units communicating over a network to accomplish a common objective by sharing
their resources. The types of hardware, insfracture, interface, programming languages, operating systems
and other resources may be dynamic in nature.
A collection of computers that do not share common memory or a common physical clock, that com-
municate by a messages passing over a communi- cation network, and where each computer has its own
memory and runs its own operating system. Typically the computers are semi-autonomous and are loosely
coupled while they cooperate to address a problem collectively
Example: Assume that, there is a set of n tasks denoted as T = {t1 , t2 , t3 , . . . tn } and a set of m number of
processors denoted as P = {p1 , p2 , p3 , . . . pm }. Here, all the tasks need to be completed by maintaining
cooperativeness among different processors.
Features- i) No common physical clock ii) No shared memory iii) Geographical separation iv) Autonomy
and heterogeneity
Kinds of DS:
• Distributed Computing Systems i) Cluster Computing Systems ii) Grid Computing Systems - high degree
of heterogeneity: no assumptions are made concerning hardware, operating systems, networks, adminis-
trative domains, secu- rity policies, etc.iii)
• Distributed Pervasive Systems:- As its name suggests, a distributed pervasive system is part of our
surround- ings (and as such, is generally inherently distributed). An important feature is the general lack
of human administrative control. At best, devices can be configured by their owners, but otherwise they
need to automatically discover their environ- ment and "nestle in" as best as possible. i)Home Systems
ii) Sensor Networks
Motivation:
1
1 Chapter 1
• Resource sharing.
• Scalability.
Goals of DS:
• Distribution Transparency- Access, Location, Migration, Relocation, Replication, concurrency and fail-
ure.
• Openness
• Scalability- scalable with respect to its size- geographically scalable system is one in which the users and
resources may lie far apart- administratively scalable,/ aning that it can still be easy to manage even if it
spans many independent administrative organizations.
• Enhanced Reliability –availability, i.e., the resource should be accessible at all times; – integrity, i.e., the
value/state of the resource should be correct, in the face of concurrent access from multiple processors,
as per the semantics expected by the application; –fault-tolerance, i.e., the ability to recover from system
failures, where such failures may be defined to occur in one of many failure models,
• Pitfalls
Parallel computing
• A multiprocessor system is a parallel system in which the multiple processors have direct access to shared
memory which forms a common address space. Such processors usually do not have a common clock.
2
1 Chapter 1
– A multiprocessor system usually corresponds to a uniform memory access (UMA) architecture in which
the access latency, i.e., waiting time, to complete an access to any memory location from any processor
is the same.
• A multicomputer parallel system is a parallel system in which the multiple processors do not have direct
access to shared memory. The memory of the multiple processors may or may not form a common address
space. Such computers usually do not have a common clock.
– A multicomputer system that has a common address space usually corresponds to a non-uniform mem-
ory access (NUMA) architecture in which the latency to access various shared memory locations from
the different processors varies.
• Array processors belong to a class of parallel computers that are physically co-located, are very tightly
coupled, and have a common system clock (but may not share memory and communicate by passing data
using messages).
• the corresponding computational model is characterized by multiple processors and associated mecha-
nisms of cooperation.
• the algorithm has to exploit in an efficient way the parallelism that can be made explicit, with the goal of
making the execution faster
• processors work on their own parts of the whole problem, but cooperate for solving shared portions of the
problem The Final goal is · to balance the load among processors
· to reduce overheads due to parallelization
· to reduce idle times, diminish as much as possible
· communications/synchronizations needed for processor cooperation
• The degree of coupling among a set of modules, whether hardware or software, is measured in terms of
the interdependency and binding and/or homogeneity among the modules. When the degree of coupling
is high (low), the modules are said to be tightly (loosely) coupled. Ex. SIMD and MISD.
– network operating system: The operating system running on loosely coupled processors (i.e., heteroge-
nous and/or geographically distant processors), which are themselves running loosely coupled software
(i.e., software that is heterogenous).
– The operating system running on loosely coupled processors, which are running tightly coupled soft-
ware (i.e., the middleware software on the processors is homogenous), is classified as a distributed oper-
ating system.
– The operating system running on tightly coupled processors, which are themselves running tightly
coupled software, is classified as a multiprocessor operating system.
3
1.1 Comparison Metrics
• Concurrency of a program- This is a broader term that means roughly the same as parallelism of a
program, but is used in the context of distributed programs. The parallelism/concurrency in a paral-
lel/distributed program can be measured by the ratio of the number of local (non-communication and
non-shared memory access) operations to the total number of operations, including the communication
or shared memory access operations.
• Granularity of a program- The ratio of the amount of computation to the amount of communication within
the parallel/distributed program is termed as granularity.
2. Load Balance: load balancing is a process where workloads are distributed in such a way that improves
the system performance in terms of optimized resource use, maximize throughput, minimize response
time, and avoid overload. load balance is computed as follow.
rP
m
i=1 (µ − Ex(pi ))2
LB = (2)
m
Pm
Ex(pi )
where µ = i=1
m is average execution time.
3. Processor Utilization: It defines the ratio between the average amount of time in which the processors
are busy by the overall system schedule time. It is calculated as follow.
Pm
1 Ex(pi )
UT = × i=1
× 100 (3)
Mks m
4. S peed up Factor: The speed up factor can be defined as the ratio between sequential execution time by
makespan (parallel schedule length) as given below.
ET (t j , pk )
Pn
j=1
SP = (4)
Mks
SP
Ef = (5)
total number o f processor
4
2 Pipeline
Another way to find out speed up of p processor is- let ξ is the fraction of the problem is sequential. So, (1 − ξ)
fraction of the problem is parallel. Now Best parallel time is T p
1−ξ
T p = T p × (ξ + ) (6)
p
(
SP = 1 1−ξ
) (7)
ξ+ p
(a) A multiprocessor system is a parallel system in which the multiple processors have direct access to shared
memory which forms a common address space.
(b) A multicomputer parallel system is a parallel system in which the multiple processors do not have direct
access to shared memory. The memory of the multiple processors may or may not form a common address
space. Such computers usually do not have a common clock.
Uniform memory access (UMA) multiprocessor system. (b) Non-uniform memory access (NUMA) multi-
processor
2. Pipeline
2.1. Models
A distributed system consists of a set of processors that are connected by a communication network. The
communication network provides the facility of information exchange among processors. The communication
delay is finite but unpredictable. The processors do not share a common global memory and communicate solely
by passing messages over the communication network. There is no physical global clock in the system to which
processes have instantaneous access. The communication medium may deliver messages out of order, messages
may be lost, garbled, or duplicated due to timeout and retransmission, processors may fail, and communication
links may go down. The system can be modeled as a directed graph in which vertices represent the processes
and edges represent unidirectional communication channels.
5
2.2 Single Node Perspective
• If a and b are the events in the same process, where a occurcs before b, then a → b
• If a is the event of sending msg by a process pi and b is the events of receiving same msg by p j (i , j)
then a → b.
• if ∀a ∈ S then a 9 a (non-refelxive).
A system of logical clocks consists of a time domain T and a logical clock C.The logical clock C is a function
that maps an event e in a distributed system to an element in the time domain T , denoted as
C(e) and called the timestamp of e, and is defined as follows: C : H → T two events ei and e j , ei → e j ⇒
C(ei ) < C(e j ). the system of clocks is said to be strongly consistent.
R1 Before executing an event (send, receive, or internal), process pi executes the following: Ci = Ci + d,
d > 0, d can have a different value, and this value may be application-dependent. However, typically d is kept
at 1 because this is able to identify the time of each event uniquely at a process
R2, When a process pi receives a message with timestamp C msg , it executes the following actions:
1. Ci = max(CiCm sg) ;
2. execute R1;
3. deliver the message.
6
2.4 Leader Election
• Bully- For fully connected network by Garcia-Molina (1982). – P sends an ELECTION message to all
processes with higher numbers.
– If no one responds, P wins the election and becomes coordinator.
– If one of the higher-ups answers, it takes over. P’s job is done.
References
7
Principles and characteristics of
distributed systems and
environments
Definition of a distributed system
Distributed system is a collection of independent
computers that appears to its users as a single coherent
system.
Distributed computing is decentralised and parallel
computing, using two or more computers communicating
over a network to accomplish a common objective or
task. The types of hardware, programming languages,
operating systems and other resources may vary
drastically.
Leslie Lamport: A distributed system is one I which the
failure of a computer you didn’t even know existed can
render your computer unusable.
Definition cont.
2 aspects
Deals with hardware: machines are autonomous
Deals with software: the users think they are dealing
with a single system
Important characteristics
Communication is hidden from users
Applications can interact with uniform and consistent
way
Relatively easy to expand or scale (DS will normally be
continuously available)
A distributed system should be functionally equivalent to
the systems of which it is composed.
Distributed system as Middleware service
Performance
Congestion of communication links
Only decentralized algorithms should be used (no
machine has complete information; decisions based on
local information; failure of one machine does not ruin
the algorithm; no implicit assumption of global clock)
Scaling techniques
Leader election
Mutual exclusion
Time synchronization
Global state
Replica management
RM-ODP
Coherent
Overloaded bus –> cache memory
between CPU and bus (hit rate)
Bus suitable until 256 CPUs
Multiprocessors cont.
Crossbar switch – crosspoint switch
n CPUs and n memories = n2 crosspoint switches
Omega network
2x2 switches, each two inputs and two outputs
Fast switching, not cheap
NUMA (NonUniform Memory Access)
Fast access to local memory, slow to other’s
Better average access times than omega networks
Multiprocessors cont.
Application layering
Distinction of 3 levels
The user-interface level
The processing level
The data level
Multitiered architectures
Peer-to-peer distribution
Model of Distributed Computing
Q: What is a distributed system?
- Bunch of nodes/processes
- Sending messages over a network
- To solve a common goal (algorithm)
Modelling a Node
A single node has a bunch of neighbours
- Can send & receive messages
- Can do local computations
Formally:
- States {X0, X1, X2, X1’}
- Transition function {X0 -> X1, X1 -> X2, X2 -> X1}
- Initial States {X0}
Modelling a Node
State Machine of Node i
- Set of states Qi
Initial States
- inbuf is empty
State of one Node
Example States
Transition Functions
The state of a node, except outbuf, is called the accessible state of a
node.
Transition function f
• Takes accessible state and gives new state
• Removes at most one message from inbuf in new state
• Adds at zero, one, or more new msgs in outbuf of new state
• Possibly modifies local state
Transition Functions Formally
State of a node is triple <s, I, O>
- s is the local state
- I is inbuf
- O is outbuf
- f state transition function
- f(si, Ii, Oi) -> (si + 1, Ii + 1, Oi + 1)
- Removes at one message m from inbuf
o Ii + 1 = Ii / {m}
- Computes a function f’(si, m) -> (si + 1, {m1, …, mn})
- Adds {m1, …, mn} to outbuf
o Oi + 1 = Oi ∪ {m1, …, mn}
Execution
An execution is an infinite sequence of
- config0, event1, config1, event2, config2 …
- config0 is an initial configuration
If eventk is comp(i)
- configk – 1 changes to configk by applying pi’s transition function on i’s
state in configk – 1
If eventk is del(i, j, m)
- configk – 1 changes to configk by applying moving m from i’s outbuf to j’s
inbuf
Example Execution
Event1 Event2
x=4 del(2,3,m7) x=4 Comp(3) x=1
P2 outbuf={m7}
P2
outbuf={} P2 outbuf={}
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 1/1
Distributed Computing: Principles, Algorithms, and Systems
A Distributed Program
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 2/1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 3/1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 4/1
Distributed Computing: Principles, Algorithms, and Systems
The send and the receive events signify the flow of information between
processes and establish causal dependency from the sender process to the
receiver process.
A relation →msg that captures the causal dependency due to message
exchange, is defined as follows. For every message m that is exchanged
between two processes, we have
send(m) →msg rec(m).
Relation →msg defines causal dependencies between the pairs of
corresponding send and receive events.
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 5/1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 6/1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 7/1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 8/1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 9/1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 10 / 1
Distributed Computing: Principles, Algorithms, and Systems
Concurrent events
For any two events ei and ej , if ei 6→ ej and ej 6→ ei ,
then events ei and ej are said to be concurrent (denoted as ei k ej ).
In the execution of Figure 2.1, e13 k e33 and e24 k e31 .
The relation k is not transitive; that is, (ei k ej ) ∧ (ej k ek ) 6⇒ ei k ek .
For example, in Figure 2.1, e33 k e24 and e24 k e15 , however, e33 6k e15 .
For any two events ei and ej in a distributed execution,
ei → ej or ej → ei , or ei k ej .
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 11 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 12 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 13 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 14 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 15 / 1
Distributed Computing: Principles, Algorithms, and Systems
Notations
LSix denotes the state of process pi after the occurrence of event eix and
before the event eix+1 .
LSi0 denotes the initial state of process pi .
LSix is a result of the execution of all the events executed by process pi till eix .
Let send(m)≤LSix denote the fact that ∃y :1≤y ≤x :: eiy =send(m).
Let rec(m)6≤LSix denote the fact that ∀y :1≤y ≤x :: eiy 6=rec(m).
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 16 / 1
Distributed Computing: Principles, Algorithms, and Systems
A Channel State
The state of a channel depends upon the states of the processes it connects.
Let SCijx,y denote the state of a channel Cij .
The state of a channel is defined as follows:
Thus, channel state SCijx,y denotes all messages that pi sent upto event eix and
which process pj had not received until event ejy .
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 17 / 1
Distributed Computing: Principles, Algorithms, and Systems
Global State
The global state of a distributed system is a collection of the local states of
the processes and the channels.
Notationally, global state GS is defined as,
y ,z
GS = { i LSixi , j,k SCjkj k }
S S
For a global state to be meaningful, the states of all the components of the
distributed system must be recorded at the same instant.
This will be possible if the local clocks at processes were perfectly
synchronized or if there were a global system clock that can be
instantaneously read by the processes. (However, both are impossible.)
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 18 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 19 / 1
Distributed Computing: Principles, Algorithms, and Systems
e41 e42
p4
time
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 20 / 1
Distributed Computing: Principles, Algorithms, and Systems
In Figure 2.2:
A global state GS1 = {LS11 , LS23 , LS33 , LS42 } is inconsistent
because the state of p2 has recorded the receipt of message m12 , however,
the state of p1 has not recorded its send.
A global state GS2 consisting of local states {LS12 , LS24 , LS34 , LS42 }
is consistent; all the channels are empty except C21 that
contains message m21 .
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 21 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 22 / 1
Distributed Computing: Principles, Algorithms, and Systems
C1 C2
e11 e21 e3
1
e 14
p
1
4
e12 e22 e23 e2
p2
1
e3 e 23 e 33 e34 e35
p3
e41 e2
p4 4
time
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 23 / 1
Distributed Computing: Principles, Algorithms, and Systems
In a consistent cut, every message received in the PAST of the cut was sent
in the PAST of that cut. (In Figure 2.3, cut C2 is a consistent cut.)
All messages that cross the cut from the PAST to the FUTURE are in transit
in the corresponding consistent global state.
A cut is inconsistent if a message crosses the cut from the FUTURE to the
PAST. (In Figure 2.3, cut C1 is an inconsistent cut.)
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 24 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 25 / 1
Distributed Computing: Principles, Algorithms, and Systems
PAST( ej ) FUTURE( e j )
e
j
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 26 / 1
Distributed Computing: Principles, Algorithms, and Systems
Let Pasti (ej ) be the set of all those events of Past(ej ) that are on process pi .
Pasti (ej ) is a totally ordered set, ordered by the relation →i , whose maximal
element is denoted by max(Pasti (ej )).
max(Pasti (ej )) is the latest event at process pi that affected event ej (Figure
2.4).
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 27 / 1
Distributed Computing: Principles, Algorithms, and Systems
S
Let Max Past(ej ) = (∀i ) {max(Pasti (ej ))}.
Max Past(ej ) consists of the latest event at every process that affected event
ej and is referred to as the surface of the past cone of ej .
Past(ej ) represents all events on the past light cone that affect ej .
Future Cone of an Event
The future of an event ej , denoted by Future(ej ), contains all events ei that
are causally affected by ej (see Figure 2.4).
In a computation (H, →), Future(ej ) is defined as:
Future(ej ) = {ei |∀ei ∈ H, ej → ei }.
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 28 / 1
Distributed Computing: Principles, Algorithms, and Systems
Define Futurei (ej ) as the set of those events of Future(ej ) that are on process
pi .
define min(Futurei (ej )) as the first event on process pi that is affected by ej .
S
Define Min Future(ej ) as (∀i ) {min(Futurei (ej ))}, which consists of the first
event at every process that is causally affected by event ej .
Min Future(ej ) is referred to as the surface of the future cone of ej .
All events at a process pi that occurred after max(Pasti (ej )) but before
min(Futurei (ej )) are concurrent with ej .
Therefore, all and only those events of computation H that belong to the set
“H − Past(ej ) − Future(ej )” are concurrent with event ej .
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 29 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 30 / 1
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 31 / 1
Distributed Systems Fö 2/3- 1 Distributed Systems Fö 2/3- 2
3. Fault Models
• Resource managers can be in general modelled as
processes.
If the system is designed according to an object-
oriented methodology, resources are encapsulated
in objects.
• Peer-to-peer
peer peer
☞ Peer-to-Peer tries to solve some of the above - multiple servers and caches
- mobile code and mobile agents
• It distributes shared resources widely
- low-cost computers at the users’ side
- mobile devices
• Proxy servers are typically used as caches for web Step 2: interact with applet
resources. They maintain a cache of recently
visited web pages or other resources.
When a request is issued by a client, the proxy client applet server
server is first checked, if the requested object
(information item) is available there.
Network computers
☞ Mobile agent: a running program that travels from
one computer to another carrying out a task on
someone’s behalf.
Typical tasks:
Advantages:
Attention: potential security risk (like mobile code)!
• The network computer can be simpler, with limited
capacity; it does not need even a local hard disk (if
there exists one it is used to cache data or code).
• Users can log in from any computer.
• No user effort for software management/
administration.
• Synchronous distributed systems • Drift rates between local clocks have a known
bound.
• No bound on process execution time (nothing can ☞ Asynchronous systems are widely and successfully
be assumed about speed, load, reliability of used in practice.
computers).
• No bound on message transmission delays In practice timeouts are used with asynchronous
(nothing can be assumed about speed, load, systems for failure detection.
reliability of interconnections) However, additional measures have to be applied in
order to avoid duplicated messages, duplicated
• No bounds on drift rates between local clocks. execution of operations, etc.
Important consequences:
What kind of faults can occur and what are their effects?
Intended processing steps or communications are • Architectural models define the way responsibilities
omitted or/and unintended ones are executed. are distributed among components and how they
Results may not come at all or may come but carry are placed in the system.
wrong values.
We have studied three architectural models:
1. Client-server model
2. Peer-to-peer
3. Several variations of the two
Timing Faults
• Interaction models deal with how time is handled
throughout the system.
☞ Timing faults can occur in synchronous distributed Two interaction models have been introduced:
systems, where time limits are set to process 1. Synchronous distributed systems
execution, communications, and clock drifts. 2. Asynchronous distributed systems
A timing fault occurs if any of this time limits is
exceeded. • The fault model specifies what kind of faults can
occur and what their effects are.
Fault models:
1. Omission faults
2. Arbitrary faults
3. Timing faults
RMI, RPC
2. Network Protocol
Request&Reply
Middleware
Reply RMI, RPC
Request&Reply
Network
Operating System&Network Protocol
The solution:
The server: - Asking for a service is solved by the client
issuing a simple method invocation or procedure
------------------ call; because the server can be on a remote
receive(request) from client-reference; machine this is a remote invocation (call).
execute requested operation
send (reply) to client_reference; - RMI (RPC) is transparent: the calling object
(procedure) is not aware that the called one is
------------------
executing on a different machine, and vice
versa.
Request
Invocation
local ref.
local ref.
Client
Answer
Server
local ref. ↔ rem. ref.
Reply
module
Skeleton for B
Network
arguments
Marshal
results
Implementation of RMI
local ref.
Request
rem. ref.
Reply
------------------
local ref.
local ref.
The server contains the method:
Remote reference Communication
Proxy for B
Unmarshal
arguments
results
{ - - - };
Invocation
f.
Answer
local ref.
local re
module
Client
- The proxy is the local representative of the - A part of the skeleton is also called dispatcher.
remote object ⇒ the remote invocation from A The dispatcher receives a request from the
to B is initially handled like a local one from A to communication module, identifies the invoked
the proxy for B. method and directs the request to the
corresponding method of the skeleton.
- At invocation, the corresponding proxy method
marshals the arguments and builds the message
to be sent, as a request, to the server.
After reception of the reply, the proxy
unmarshals the received message and sends
the results, in an answer, to the invoker.
Problem: what if the request was not truly lost (but, for Danger!
example, the server is too slow) and the server • Some operations can be executed more than once
receives it more than once? without any problem; they are called idempotent
operations ⇒ no danger with executing the
duplicate request.
• We have to avoid that the server executes certain
operations more than once. • There are operations which cannot be executed
repeatedly without changing the effect (e.g.
• Messages have to be identified by an identifier and
transferring an amount of money between two
copies of the same message have to be filtered out:
accounts) ⇒ history can be used to avoid re-
- If the duplicate arrives and the server has not execution.
yet sent the reply ⇒ simply send the reply.
- If the duplicate arrives after the reply has been
History: the history is a structure which stores a record
sent ⇒ the reply may have been lost or it didn’t
of reply messages that have been transmitted,
arrive in time (see next slide).
together with the message identifier and the
client which it has been sent to.
Server
Request
receive
no Reply crash
Big problem!
The client cannot distinguish between cases b and c!
When the client got an answer, the RMI has been
However they are very different and should be handled in carried out at least one time, but possibly more.
a different way!
Alternative 2: at most once semantics
• The client’s communication module gives up and
immediately reports the failure to the client (e.g. by
What to do if the client noticed that the server is down raising an exception)
(it didn’t answer to a certain large number of repeated
requests)?
- If the client got an answer, the RMI has been
executed exactly once.
- If the client got a failure message, the RMI has
been carried out at most one time, but possibly
not at all.
Problems:
☞ If server crashes with loss of memory (case b on
• wasting of CPU time slide 44) are considered, only at least once and at
• locked resources (files, peripherals, etc.) most once semantics are achievable in the best case.
• if the client reboots and repeats the RMI, confusion
can be created. In practical applications, servers can survive crashes
without loss of memory. In such cases history can be
used and duplicates can be filtered out after restart of
the server:
The solution is based on identification and killing the
orphans. • the client repeats sending requests without being in
danger operations to be executed more than one
time (this is different from alternative 2 on slide 46):
- If no answer is received after a certain amount
of tries, the client is notified and he knows that
the method has been executed at most one
time or not at all.
- If an answer is received it is forwarded to the
client who knows that the method has been
executed exactly one time.
☞ And no hope about achieving exactly once semantics • With group communication a message can be sent
if servers crash ?! to multiple receivers in one operation, called
multicast.
14.1 Overview
Many tasks in distributed systems require one of the processes to act as the coordinator. Election algorithms
are techniques for a distributed system of N processes to elect a coordinator (leader). An example of this is
the Berkeley algorithm for clock synchronization, in which the coordinator has to initiate the synchronization
and tell the processes their offsets. A coordinator can be chosen amongst all processes through leader election.
The bully algorithm is a simple algorithm, in which we enumerate all the processes running in the system
and pick the one with the highest ID as the coordinator. In this algorithm, each process has a unique ID
and every process knows the corresponding ID and IP address of every other process. A process initiates an
election if it just recovered from failure or if the coordinator failed. Any process in the system can initiate
this algorithm for leader election. Thus, we can have concurrent ongoing elections. There are three types of
messages for this algorithm: election, OK and I won. The algorithm is as follows:
14-1
14-2 Lecture 14: March 21
An example of Bully algorithm is given in Figure 14.1. Communication is assumed to be reliable during
leader election. If the communication is unreliable, it may happen that the elected coordinator goes down
after it being elected, or a higher ID node comes up after the election process. In the former case, any node
might start an election process after gauging that the coordinator isn’t responding. In the latter case, the
higher ID process asks its neighbors who is the coordinator. It can then either accept the current coordinator
as its own coordinator and continue, or it can start a new election (in which case it will probably be elected as
the new coordinator). This algorithm runs in O(n2 ) time in the worst case when lowest ID process initiates
the election. The name bully is given to the algorithm because the higher ID processes are bullying the lower
ID processes to drop out of the election.
Question: What happens if 7 has not crashed? Who will it send message to?
Answer : Suppose if 7 has not crashed in the example, it would have sent the response when 4 has started
the election. 4 would have dropped out and the recursion would have continued and 7 would have elected
as leader finally.
Question: Can 7 never initiate the election?
Answer : If 7 is already a leader there is no reason for it to initiate an election.
Question: When does a smaller ID process know it should start an election?
Answer : This is not particularly specified by this algorithm. Ideally this is done when it has not heard from
the coordinator in a while (timeout period).
Question: In the above example what happens if 7 is recovered?
Answer : Any process that is recovered will initiate an election. It will see 6 is the coordinator. In this case
7 will initiate an election and will win.
Question: In the example, how will 7 see 6 is the coordinator (How does a process know who the coordinator
is)?
Answer : Discovering who is the coordinator is not part of the algorithm. This should be implemented
separately (storing it somewhere, broadcasting the message).
Lecture 14: March 21 14-3
Question: What happens when we have a highly dynamic system where processes regularly leave and join
(P2P system)?
Answer : The bully algorithm is not adequate for all kinds of scenarios. If you have a dynamic system, you
might want to take into account the more stable processes (or other metrics) and give them higher ids to
have them win elections.
The ring algorithm is similar to the bully algorithm in the sense that we assume the processes are already
ranked through some metric from 1 to n. However, here a process i only needs to know the IP addresses of
its two neighbors (i+1 and i-1). We want to select the node with the highest id. The algorithm works as
follows:
Any node can start circulating the election message. Say process i does so. We can choose to go
clockwise or counter-clockwise on the ring. Say we choose clockwise where i+1 occurs after i.
Process i then sends an election message to process i+1.
Anytime a process j 6= i receives an election message, it piggybacks its own ID (thus declaring that it
is not down) before calling the election message on its successor (j+1).
Once the message circulates through the ring and comes back to the initiator i, process i knows the
list of all nodes that are alive. It simply scans the list and chooses the highest id.
It lets all other nodes about the new coordinator.
An example of Ring algorithm is given in Figure 14.2. If the neighbor of a process is down, it sequentially
polls each successor (neighbor of neighbor) until it finds a live node. For example, in the figure, when 7 is
down, 6 passes the election message to 0. Another thing to note is that this requires us to enforce a logical
ring topology on the underlying application, i.e. we need to construct a ring topology on top of the whole
system just for leader election.
Question: Does every process need to know about the network topology?
Answer : Every process know the IDs and IP addresses of other processes (assumption). There is no real
topology, from the table we can find the neighbours.
14-4 Lecture 14: March 21
Question: If we already know the IDs of all processes why there is a need to go around the ring?
Answer : There can be different kind of failures, e.g., the process may crash or the network may crash while
partitioning the ring. We do no know how many processes have disconnected from the ring. We need to
actually query and check what is the issue.
Question: How does 6 know that it has to send message to 0 when 7 is down?
Answer : Here we can assume that every node not only has information of its neighbors, but neighbors’
neighbors as well. In general, in a system where we expect a max of k nodes to go down during the time
leader election takes to execute, each node needs to know at least k successive neighbors in either direction.
This is still less than what each node needs to know in Bully algorithm.
O(n2 ) in the worst case (this occurs when the node with lowest ID initiates the election)
O(n − 2) in the best case (this occurs when the node with highest ID that is alive initiates the election)
The ring algorithm always takes 2(n-1) messages to execute. The first (n-1) is during the election query and
second time to announce the results of the election. It is easy to extend ring algorithm for other metrics like
load, etc.
Question: How do you know a node is not responding?
Answer : If it has actually crashed then TCP will fail while setting up socket connection. Otherwise, it
can be a slow machine which is taking time. It is a classical problem in distributed systems to distinguish
between a slow process and a failed process which is a non-trivial problem. Timeout is not an ideal solution
but can be used in practice.
Every time we wish to access a shared data structure or critical section in a distributed system, we need to
guard it with a lock. A lock is acquired before the data structure is accessed, and once the transaction has
completed, the lock is released. Consider the example below:
In this example, there are two clients sending a buy request to the Online Store Server. The store implements
a thread-pool model. Initially the item count is 3. The correct item count should be 1 after two buy
operations. If locks are not implemented there may be chance of race condition and item count can be 2.
Lecture 14: March 21 14-5
This is because the decrement is not an atomic operation. Each thread needs to read, update and write the
item value. The second thread might read the value while first thread is updating the value (it will read 3)
ans update it to 2 and save it, which is incorrect. This is an example of trivial race condition.
In this case, locking and unlocking coordination are done by a master process. All processes are numbered
1 to n. We run leader election to pick the coordinator. Now, if any process in the system wants to acquire
a lock, it has to first send a lock acquire request to the coordinator. Once it sends this request, it blocks
execution and awaits reply until it acquires the lock. The coordinator maintains a queue for each data
structure of lock requests. Upon receiving such a request, if the queue is empty, it grants the lock and sends
the message, otherwise it adds the request to the queue. The requester process upon receiving the lock
executes the transaction, and then sends a release message to the coordinator. The coordinator upon receipt
of such a message removes the next request from the corresponding queue and grants that process the lock.
This algorithm is fair and simple.
An example of the algorithm is given in Figure 14.4. There are two major issues with this algorithm, related
to failures. When coordinator process goes down while one of the processes is waiting on a response to a
lock request, it leads to inconsistency. The new coordinator that is elected (or reboots) might not know that
the earlier process is still waiting for a response. This issue can be tackled by maintaining persistent data
on disk whenever a queue of the coordinator is altered. Even if the process crashes, we can read the file and
persist the state of the locks on storage and recover the process.
The harder problem occurs when one of the client process crashes while it is holding the lock (during one
of its transactions). In such a case, coordinator is just waiting for the lock to be released while the other
process has gone down. We cannot use timeout in this case, because usually transactions take arbitrary
amount of time to go through. All other processes that are waiting on that lock are also blocked forever.
Even if the coordinator somehow knew that the client process has crashed, it may not always be advisable
to take the lock forcibly back because the client process may eventually reboot and think it has the lock and
continue its transaction. This causes inconsistency. This is a thorny problem which does not have any neat
solution. This limits the practicality of such an centralized algorithm.
Decentralized algorithms use voting to figure out which lock requests to grant. In this scenario, each pro-
cess has an extra thread called the coordinator thread which deals with all the incoming locking requests.
14-6 Lecture 14: March 21
Essentially, every process keeps a track of who has the lock, and for a new process to acquire a new lock, it
has to be granted an OK or go ahead vote from the strict majority of the processes. Here, majority means
more than half the total number of nodes (live or not) in the system. Thus, if any process wishes to acquire
a lock, it requests it from all other processes and if the majority of them tell it to acquire the lock, it goes
ahead and does so. The majority guarantees that a lock is not granted twice. Upon the receipt of the vote,
the other processes are also told that a lock has been acquired and thus, the processes hold up any other
lock request. Once a process is done with the transaction, it broadcasts to every other process that it has
released the lock.
This solves the problem of coordinator failure because if some nodes go down, we can deal with it so long as
the majority agrees that whether the lock is in use or not. Client crashes are still a problem here.
This algorithm, developed by Ricart and Agrawala, needs 2(n − 1) messages and is based on Lamport’s clock
and total ordering of events to decide on granting locks. After the clocks are synchronized, the process that
asked for the lock first gets it. The initiator sends request messages to all n − 1 processes stamped with its
ID and the timestamp of its request. It then waits for replies from all other processes.
Any other process upon receiving such a request either sends reply if it does not want the lock for itself, or
is already in the transaction phase (in which case it doesn’t send any reply and the initiator has to wait),
or it itself wants to acquire the same lock in which case it compares its own request timestamp with that of
the incoming request. The one with the lower timestamp gets the lock first.
This approach is fully decentralized but there are n points of failure, which is worse than the centralized
one.
In the token ring algorithm, the actual topology is not a ring, but for locking purpose there is a logical ring
and processes only talk to neighboring processes. A process with the token has the lock at that time. To
acquire the lock one needs to wait. The token is circulated through the logical ring. No method is there
to request the token, the process needs to wait to get a lock. Once the process has token it can enter the
critical section.
Lecture 14: March 21 14-7
This was designed as part of a networking protocol Token Ring. In physical networking, only one node can
transmit at a time. If multiple nodes transmit at a time there is a chance of collision. Ethernet handled it
by detecting collisions. A node transmits and if there is a collision it backs off and will succeed eventually.
In Token Ring this was handled using locks. Only one machine has lock at an instance in the network and
it transmit at that particular time.
In this algorithm one problem is loss of token. Regenerating the token is non-trivial, as you can not use
timeout strategy.
Question: In a Token Ring, when should you hold the token?
Answer : If a process want to send data on the network it will wait until it receives the token. It will hold
the token until it completes the transmission and pass the token to the next. If it do not require the token
it simply passes it.
Question: Is the token generated by the process?
Answer : A token is special message that is circulating in the network. It is generated when the system
started. A process can hold the message or pass it to the next.
Question: In Token Ring, if the time limit is finished and the process is still in the critical section, what
happens?
Answer : In general network transmission this can limit the amount of data you can transmit. But if this is
used for locking the message can be held for arbitary amount of time (based on the critical section). This
may complicate the token recovery because we can’t distinguish if a token is lost or any process is in long
critical section.
This was developed by Google to provide a service that can manage lots of locks for different subgroups of
applications. Each Chubby cell has group of 5 machines supporting 10,000 servers (managing the locks of
applications running on them). This was designed for coarse-grain locking with high reliability. One of the
5 machines is maintained outside the data center for recovery.
Chubby uses distributed lock to elect a leader. The process with the lock is elected as leader. The 5 processes
in the cell run an internal leader election to elect a primary. The other are lock workers. These are used
for replication. The applications in the system ask for locks and release locks using RPC calls. All of the
locks requests go to the primary. The lock state is kept persisted on disk by all the machines. This uses a
file abstraction for locks. To lock and unlock, file is locked and unlocked respectively. State information can
also be kept in the file. It supports reader-writer locks. If the primary fails it triggers a new leader election.
Introduction
Introduction
This chapter discusses three ways to implement logical time - scalar time,
vector time, and matrix time.
Causality among events in a distributed system is a powerful concept in
reasoning, analyzing, and drawing inferences about a computation.
The knowledge of the causal precedence relation among the events of
processes helps solve a variety of problems in distributed systems, such as
distributed algorithms design, tracking of dependent events, knowledge about
the progress of a computation, and concurrency measures.
Definition
A system of logical clocks consists of a time domain T and a logical clock C .
Elements of T form a partially ordered set over a relation <.
Relation < is called the happened before or causal precedence. Intuitively,
this relation is analogous to the earlier than relation provided by the physical
time.
The logical clock C is a function that maps an event e in a distributed
system to an element in the time domain T , denoted as C(e) and called the
timestamp of e, and is defined as follows:
C : H 7→ T
such that the following property is satisfied:
for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).
Scalar Time
Scalar Time
R2: Each message piggybacks the clock value of its sender at sending time.
When a process pi receives a message with timestamp Cmsg , it executes the
following actions:
◮ Ci := max(Ci , Cmsg )
◮ Execute R1.
◮ Deliver the message.
Figure 3.1 shows evolution of scalar time.
Scalar Time
Evolution of scalar time:
1 2 3 8 9
p
1
9
2
1 4 5 7 11
p
2
3 10
4
1 b
p
3
5 6 7
Basic Properties
Consistency Property
Scalar clocks satisfy the monotonicity and hence the consistency property:
for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).
Total Ordering
Scalar clocks can be used to totally order events in a distributed system.
The main problem in totally ordering events is that two or more events at
different processes may have identical timestamp.
For example in Figure 3.1, the third event of process P1 and the second event
of process P2 have identical scalar timestamp.
Total Ordering
Properties. . .
Event counting
If the increment value d is always 1, the scalar time has the following
interesting property: if event e has a timestamp h, then h-1 represents the
minimum logical duration, counted in units of events, required before
producing the event e;
We call it the height of the event e.
In other words, h-1 events have been produced sequentially before the event e
regardless of the processes that produced these events.
For example, in Figure 3.1, five events precede event b on the longest causal
path ending at b.
Properties. . .
No Strong Consistency
The system of scalar clocks is not strongly consistent; that is, for two events
ei and ej , C(ei ) < C(ej ) 6=⇒ ei → ej .
For example, in Figure 3.1, the third event of process P1 has smaller scalar
timestamp than the third event of process P2 .However, the former did not
happen before the latter.
The reason that scalar clocks are not strongly consistent is that the logical
local clock and logical global clock of a process are squashed into one,
resulting in the loss causal dependency information among events at different
processes.
For example, in Figure 3.1, when process P2 receives the first message from
process P1 , it updates its clock to 3, forgetting that the timestamp of the
latest event at P1 on which it depends is 2.
Vector Time
Vector Time
Process pi uses the following two rules R1 and R2 to update its clock:
R1: Before executing an event, process pi updates its local logical time as
follows:
vti [i] := vti [i] + d (d > 0)
R2: Each message m is piggybacked with the vector clock vt of the sender
process at sending time. On the receipt of such a message (m,vt), process pi
executes the following sequence of actions:
◮ Update its global logical time as follows:
◮ Execute R1.
◮ Deliver the message m.
Vector Time
The timestamp of an event is the value of the vector clock of its process
when the event is executed.
Figure 3.2 shows an example of vector clocks progress with the increment
value d=1.
Initially, a vector clock is [0, 0, 0, ...., 0].
Vector Time
An Example of Vector Clocks
1 2 3 4 5
0 0 0 3 3
0 0 0 4 4
p
1 5
2 2 3
0 3 4
0 0 2 2 2 4 5
1 2 3 4 6
0 0 0 0 4
p
2 5
2
3 5
0 4
0 2 2 2
0 3 3 3
1 2 3 4
p
3
Vector Time
Comparing Vector Timestamps
vh = vk ⇔ ∀x : vh[x] = vk[x]
vh ≤ vk ⇔ ∀x : vh[x] ≤ vk[x]
vh < vk ⇔ vh ≤ vk and ∃x : vh[x] < vk[x]
vh k vk ⇔ ¬(vh < vk) ∧ ¬(vk < vh)
If the process at which an event occurred is known, the test to compare two
timestamps can be simplified as follows: If events x and y respectively
occurred at processes pi and pj and are assigned timestamps vh and vk,
respectively, then
x →y ⇔ vh[i] ≤ vk[i]
x ky ⇔ vh[i] > vk[i] ∧ vh[j] < vk[j]
Vector Time
Properties of Vectot Time
Isomorphism
If events in a distributed system are timestamped using a system of vector
clocks, we have the following property.
If two events x and y have timestamps vh and vk, respectively, then
x →y ⇔ vh < vk
x ky ⇔ vh k vk.
Vector Time
Strong Consistency
The system of vector clocks is strongly consistent; thus, by examining the
vector timestamp of two events, we can determine if the events are causally
related.
However, Charron-Bost showed that the dimension of vector clocks cannot be
less than n, the total number of processes in the distributed computation, for
this property to hold.
Event Counting
If d=1 (in rule R1), then the i th component of vector clock at process pi ,
vti [i], denotes the number of events that have occurred at pi until that
instant.
So, if an event e has timestamp vh, vh[j] denotes the number
P of events
executed by process pj that causally precede e. Clearly, vh[j] − 1
represents the total number of events that causally precede e in the
distributed computation.
Thus this technique cuts down the message size, communication bandwidth
and buffer (to store messages) requirements.
In the worst of case, every element of the vector clock has been updated at
pi since the last message to process pj , and the next message from pi to pj
will need to carry the entire vector timestamp of size n.
However, on the average the size of the timestamp on a message will be less
than n.
This method is illustrated in Figure 3.3. For instance, the second message
from p3 to p2 (which contains a timestamp {(3, 2)}) informs p2 that the third
component of the vector clock has been modified and the new value is 2.
This is because the process p3 (indicated by the third component of the
vector) has advanced its clock value from 1 to 2 since the last message sent
to p2 .
This technique substantially reduces the cost of maintaining vector clocks in
large systems, especially if the process interactions exhibit temporal or spatial
localities.
p
1
1 {(1,1)}
0
1 1 1 1
0 1 2 3 4
0
0 1 2 4
0 0 0 1
p
2
0 0 0 0
0 0 0 {(3,4),(4,1)}
{(3,1)} {(3,2)} 0
1 2 3 4
0 0 1 1
p
3
0
0 {(4,1)}
0
1
p
4
Matrix Time
Matrix Time
Process pi uses the following rules R1 and R2 to update its clock:
R1 : Before executing an event, process pi updates its local logical time as
follows:
mti [i, i] := mti [i, i] + d (d > 0)
R2: Each message m is piggybacked with matrix time mt. When pi receives
such a message (m,mt) from a process pj , pi executes the following sequence
of actions:
◮ Update its global logical time as follows:
(That is, update its row mti [i, ∗] with the pj ’s row in the received timestamp,
mt.)
(b) 1 ≤ k, l ≤ n : mti [k, l] := max(mti [k, l], mt[k, l])
◮ Execute R1.
◮ Deliver message m.
Matrix Time
Matrix Time
mt e [i,k ] mt e [i,k ]
e 1k e 2k
p
k
mt e [k,j ] m2 mt e [j,j ]
m1 m4
e 1j e 2j
p
j
m3
e
p
i
mte
Matrix Time
Basic Properties
Vector mti [i, .] contains all the properties of vector clocks.
In addition, matrix clocks have the following property:
mink (mti [k, l]) ≥ t ⇒ process pi knows that every other process pk knows
that pl ’s local time has progressed till t.
◮ If this is true, it is clear that process pi knows that all other processes know
that pl will never send information with a local time ≤ t.
◮ In many applications, this implies that processes will no longer require from pl
certain information and can use this fact to discard obsolete information.
If d is always 1 in the rule R1, then mti [k, l] denotes the number of events
occurred at pl and known by pk as far as pi ’s knowledge is concerned.
Virtual Time
Virtual Time
If a conflict is discovered, the offending processes are rolled back to the time
just before the conflict and executed forward along the revised path.
Detection of conflicts and rollbacks are transparent to users.
The implementation of Virtual Time using Time Warp mechanism makes the
following optimistic assumption: synchronization conflicts and thus rollbacks
generally occurs rarely.
next, we discuss in detail Virtual Time and how Time Warp mechanism is
used to implement it.
A problem arises when a message arrives at process late, that is, the virtual
receive time of the message is less than the local virtual time at the receiver
process when the message arrives.
Virtual time systems are subject to two semantic rules similar to Lamport’s
clock conditions:
◮ Rule 1: Virtual send time of each message < virtual receive time of that
message.
◮ Rule 2: Virtual time of each event in a process < Virtual time of next event in
that process.
The above two rules imply that a process sends all messages in increasing
order of virtual send time and a process receives (and processes) all messages
in the increasing order of virtual receive time.
If event A has an earlier virtual time than event B, we need execute A before
B provided there is no causal chain from A to B.
Better performance can be achieved by scheduling A concurrently with B or
scheduling A after B.
If A and B have exactly the same virtual time coordinate, then there is no
restriction on the order of their scheduling.
If A and B are distinct events, they will have different virtual space
coordinates (since they occur at different processes) and neither will be a
cause for the other.
To sum it up, events with virtual time < ‘t’ complete before the starting of
events at time ‘t’ and events with virtual time > ‘t’ will start only after
events at time ‘t’ are complete.
When a message is sent, the virtual send time is copied from the sender’s
virtual clock while the name of the receiver and virtual receive time are
assigned based on application specific context.
All arriving messages at a process are stored in an input queue in the
increasing order of timestamps (receive times).
Processes will receive late messages due to factors such as different
computation rates of processes and network delays.
The semantics of virtual time demands that incoming messages be received
by each process strictly in the timestamp order.
Over a length computation, each process may roll back several times while
generally progressing forward with rollback completely transparent to other
processes in the system.
Rollback in a distributed system is complicated: A process that wants to
rollback might have sent many messages to other processes, which in turn
might have sent many messages to other processes, and so on, leading to
deep side effects.
For rollback, messages must be effectively “unsent” and their side effects
should be undone. This is achieved efficiently by using antimessages.
Search the ”State queue” for the last saved state with timestamp that is less
than the timestamp of the message received and restore it.
Make the timestamp of the received message as the value of the local virtual
clock and discard from the state queue all states saved after this time. Then
the resume execution forward from this point.
Now all the messages that are sent between the current state and earlier
state must be “unsent”. This is taken care of by executing a simple rule:
“To unsend a message, simply transmit its antimessage.”
This results in antimessages following the positive ones to the destination. A
negative message causes a rollback at its destination if it’s virtual receive
time is less than the receiver’s virtual time.
Depending on the timing, there are several possibilities at the receiver’s end:
First, the original (positive) message has arrived but not yet been processed
at the receiver.
In this case, the negative message causes no rollback, however, it annihilates
with the positive message leaving the receiver with no record of that message.
Second, the original positive message has already been partially or completely
processed by the receiver.
In this case, the negative message causes the receiver to roll back to a virtual
time when the positive message was received.
It will also annihilate the positive message leaving the receiver with no record
that the message existed. When the receiver executes again, the execution
will assume that these message never existed.
A rolled back process may send antimessages to other processes.
A negative message can also arrive at the destination before the positive one.
In this case, it is enqueued and will be annihilated when positive message
arrives.
If it is negative message’s turn to be executed at a processs’ input queqe, the
receiver may take any action like a no-op.
Any action taken will eventually be rolled back when the corresponding
positive message arrives.
An optimization would be to skip the antimessage from the input queue and
treat it as a no-op, and when the corresponding positive message arrives, it
will annihilate the negative message, and inhibit any rollback.
Motivation
In centralized systems, there is only single clock. A process gets the time by
simply issuing a system call to the kernel.
In distributed systems, there is no global clock or common memory. Each
processor has its own internal clock and its own notion of time.
These clocks can easily drift seconds per day, accumulating significant errors
over time.
Also, because different clocks tick at different rates, they may not remain
always synchronized although they might be synchronized when they start.
This clearly poses serious problems to applications that depend on a
synchronized notion of time.
Motivation
Clock Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time).
However, due to the clock inaccuracy discussed above, a timer (clock) is said
to be working within its specification if (where constant ρ is the maximum
skew rate specified by the manufacturer.)
dC
1−ρ≤ ≤1+ρ (1)
dt
Figure 3.5 illustrates the behavior of fast, slow, and perfect clocks with
respect to UTC.
Fast Clock
dC/dt > 1
Perfect Clock
dC/dt = 1
Clock time, C
Slow Clock
dC/dt < 1
UTC, t
Figure 3.5: The behavior of fast, slow, and perfect clocks with respect to UTC.
T1 T2
B
A
T3 T4
Let a = T1 − T3 and b = T2 − T4 .
If the network delay difference from A to B and from B to A, called
differential delay, is small, the clock offset θ and roundtrip delay δ of B
relative to A at time T4 are approximately given by the following.
a+b
θ= , δ =a−b (2)
2
Each NTP message includes the latest three timestamps T1 , T2 and T3 ,
while T4 is determined upon arrival.
Thus, both peers A and B can independently calculate delay and offset using
a single bidirectional message stream as shown in Figure 3.7.
Server B T i-3 Ti