0% found this document useful (0 votes)
22 views182 pages

PDS Merged

The document provides an overview of distributed systems, defining them as collections of independent computers that work together to achieve common goals while maintaining autonomy. It discusses various types of distributed systems, their features, motivations, goals, and comparison metrics, emphasizing aspects such as resource sharing, scalability, and transparency. Additionally, it covers parallel computing concepts, including multiprocessor systems and communication challenges within distributed environments.

Uploaded by

Nikita Kar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views182 pages

PDS Merged

The document provides an overview of distributed systems, defining them as collections of independent computers that work together to achieve common goals while maintaining autonomy. It discusses various types of distributed systems, their features, motivations, goals, and comparison metrics, emphasizing aspects such as resource sharing, scalability, and transparency. Additionally, it covers parallel computing concepts, including multiprocessor systems and communication challenges within distributed environments.

Uploaded by

Nikita Kar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 182

Lecture Notes in Parallel and Distributed Systems

By

Tarun Biswas

Assistant Professor

Department of Computer Science & Engineering

National Institute of Technology, Sikkim - 737139


Sikkim, India
1 Chapter 1

1. Chapter 1

• What is Distributed System: Distributed computing is decentralised and parallel computing, using two
or more computional units communicating over a network to accomplish a common objective by sharing
their resources. The types of hardware, insfracture, interface, programming languages, operating systems
and other resources may be dynamic in nature.

Figure 1: A sample distributed system

A collection of computers that do not share common memory or a common physical clock, that com-
municate by a messages passing over a communi- cation network, and where each computer has its own
memory and runs its own operating system. Typically the computers are semi-autonomous and are loosely
coupled while they cooperate to address a problem collectively
Example: Assume that, there is a set of n tasks denoted as T = {t1 , t2 , t3 , . . . tn } and a set of m number of
processors denoted as P = {p1 , p2 , p3 , . . . pm }. Here, all the tasks need to be completed by maintaining
cooperativeness among different processors.
Features- i) No common physical clock ii) No shared memory iii) Geographical separation iv) Autonomy
and heterogeneity

Kinds of DS:

• Distributed Computing Systems i) Cluster Computing Systems ii) Grid Computing Systems - high degree
of heterogeneity: no assumptions are made concerning hardware, operating systems, networks, adminis-
trative domains, secu- rity policies, etc.iii)

• Distributed Information Systems: i) Transaction Processing Systems ii)

• Distributed Pervasive Systems:- As its name suggests, a distributed pervasive system is part of our
surround- ings (and as such, is generally inherently distributed). An important feature is the general lack
of human administrative control. At best, devices can be configured by their owners, but otherwise they
need to automatically discover their environ- ment and "nestle in" as best as possible. i)Home Systems
ii) Sensor Networks

Motivation:

1
1 Chapter 1

• Inherently distributed computations.

• Resource sharing.

• Access to geographically remote data and resources.

• Enhanced reliability- availability, integrity, fault-tolerance.

• Increased performance/cost ratio.

• Scalability.

• Modularity and incremental expandability .

Goals of DS:

• Making Resources Accessible

• Distribution Transparency- Access, Location, Migration, Relocation, Replication, concurrency and fail-
ure.

• Openness

• Scalability- scalable with respect to its size- geographically scalable system is one in which the users and
resources may lie far apart- administratively scalable,/ aning that it can still be easy to manage even if it
spans many independent administrative organizations.

• Enhanced Reliability –availability, i.e., the resource should be accessible at all times; – integrity, i.e., the
value/state of the resource should be correct, in the face of concurrent access from multiple processors,
as per the semantics expected by the application; –fault-tolerance, i.e., the ability to recover from system
failures, where such failures may be defined to occur in one of many failure models,

• Pitfalls

Parallel computing

• A multiprocessor system is a parallel system in which the multiple processors have direct access to shared
memory which forms a common address space. Such processors usually do not have a common clock.

Figure 2: A sample parallel system

2
1 Chapter 1

– A multiprocessor system usually corresponds to a uniform memory access (UMA) architecture in which
the access latency, i.e., waiting time, to complete an access to any memory location from any processor
is the same.

• A multicomputer parallel system is a parallel system in which the multiple processors do not have direct
access to shared memory. The memory of the multiple processors may or may not form a common address
space. Such computers usually do not have a common clock.
– A multicomputer system that has a common address space usually corresponds to a non-uniform mem-
ory access (NUMA) architecture in which the latency to access various shared memory locations from
the different processors varies.

• Array processors belong to a class of parallel computers that are physically co-located, are very tightly
coupled, and have a common system clock (but may not share memory and communicate by passing data
using messages).

• solve a problem with an algorithm whose instructions are executed in parallel.

• the corresponding computational model is characterized by multiple processors and associated mecha-
nisms of cooperation.

• the algorithm has to exploit in an efficient way the parallelism that can be made explicit, with the goal of
making the execution faster

• subdivide (de-compose) the problem among the various processors/workers

• processors work on their own parts of the whole problem, but cooperate for solving shared portions of the
problem The Final goal is · to balance the load among processors
· to reduce overheads due to parallelization
· to reduce idle times, diminish as much as possible
· communications/synchronizations needed for processor cooperation

Coupling, parallelism, concurrency, and granularity

• The degree of coupling among a set of modules, whether hardware or software, is measured in terms of
the interdependency and binding and/or homogeneity among the modules. When the degree of coupling
is high (low), the modules are said to be tightly (loosely) coupled. Ex. SIMD and MISD.
– network operating system: The operating system running on loosely coupled processors (i.e., heteroge-
nous and/or geographically distant processors), which are themselves running loosely coupled software
(i.e., software that is heterogenous).
– The operating system running on loosely coupled processors, which are running tightly coupled soft-
ware (i.e., the middleware software on the processors is homogenous), is classified as a distributed oper-
ating system.
– The operating system running on tightly coupled processors, which are themselves running tightly
coupled software, is classified as a multiprocessor operating system.

3
1.1 Comparison Metrics

• Concurrency of a program- This is a broader term that means roughly the same as parallelism of a
program, but is used in the context of distributed programs. The parallelism/concurrency in a paral-
lel/distributed program can be measured by the ratio of the number of local (non-communication and
non-shared memory access) operations to the total number of operations, including the communication
or shared memory access operations.

• Granularity of a program- The ratio of the amount of computation to the amount of communication within
the parallel/distributed program is termed as granularity.

1.1. Comparison Metrics


1. Makespan: Each task ti ∈ T needs to assign and can complete their execution within a shortest possible
time. It is the completion time of all the task. It can be mathematically defined as.

Mks = max(FT (t1 ), FT (t2 ), FT (t3 ) . . . FT (tn )) (1)

2. Load Balance: load balancing is a process where workloads are distributed in such a way that improves
the system performance in terms of optimized resource use, maximize throughput, minimize response
time, and avoid overload. load balance is computed as follow.
rP
m
i=1 (µ − Ex(pi ))2
LB = (2)
m
Pm
Ex(pi )
where µ = i=1
m is average execution time.
3. Processor Utilization: It defines the ratio between the average amount of time in which the processors
are busy by the overall system schedule time. It is calculated as follow.

Pm
1 Ex(pi )
UT = × i=1
× 100 (3)
Mks m

4. S peed up Factor: The speed up factor can be defined as the ratio between sequential execution time by
makespan (parallel schedule length) as given below.

ET (t j , pk )
Pn
j=1
SP = (4)
Mks

5. E f f iciency: It the ration between SP and number of processor pi .

SP
Ef = (5)
total number o f processor

6. Cost: It the product of Mks × total number o f processor

4
2 Pipeline

Another way to find out speed up of p processor is- let ξ is the fraction of the problem is sequential. So, (1 − ξ)
fraction of the problem is parallel. Now Best parallel time is T p

1−ξ
T p = T p × (ξ + ) (6)
p

Speed up with p processor:

(
SP = 1 1−ξ
) (7)
ξ+ p

(a) A multiprocessor system is a parallel system in which the multiple processors have direct access to shared
memory which forms a common address space.
(b) A multicomputer parallel system is a parallel system in which the multiple processors do not have direct
access to shared memory. The memory of the multiple processors may or may not form a common address
space. Such computers usually do not have a common clock.
Uniform memory access (UMA) multiprocessor system. (b) Non-uniform memory access (NUMA) multi-
processor

2. Pipeline

Figure 3: Pictorial overview of execution of an instruction

2.1. Models
A distributed system consists of a set of processors that are connected by a communication network. The
communication network provides the facility of information exchange among processors. The communication
delay is finite but unpredictable. The processors do not share a common global memory and communicate solely
by passing messages over the communication network. There is no physical global clock in the system to which
processes have instantaneous access. The communication medium may deliver messages out of order, messages
may be lost, garbled, or duplicated due to timeout and retransmission, processors may fail, and communication
links may go down. The system can be modeled as a directed graph in which vertices represent the processes
and edges represent unidirectional communication channels.

• Asynchronous system- no bound to deliver a message or location comutation. Ex- internet

5
2.2 Single Node Perspective

2.2. Single Node Perspective


1. wait for a message.
2. When received message, do some local computation, send some messages.
3. goto 1.
4. Configuration:is a snapshot of states of all the nodes. Con = {q1 , q2 , q3 , . . . qm } where qi is a state of pi .
initial configuration where all the states Inbu f f = φ
5. a system evolves through events, computation Comi and delivery Deli, j,m where, Comi : means apply
transition function f node ith state and message m from Outbu f f = (m, i) → Inbu f f = (m, j)
6. Execution (EX) : A execution of a DS denotes a series of sequential steps as EX = {Con0 , Evnt1 , Con1 , Evnt1 . . . }
where Con0 is initial configuration.
7. If Evntk goes to Comi -means Con(k−1) changes to Conk by applying transition function f in ith state of pi
at Con(k−1) .
8. Deli, j,m -means Con(k−1) changes to Conk by moving message m from Outbu f f = (m, i) → Inbu f f =
(m, j)

2.3. Happen Before Relationship


Let S be a set of events (sending msg, receiving msg and local computation) in DCS.

• If a and b are the events in the same process, where a occurcs before b, then a → b

• If a is the event of sending msg by a process pi and b is the events of receiving same msg by p j (i , j)
then a → b.

• if a → b and b → c then a → c (transitivty).

• if ∀a ∈ S then a 9 a (non-refelxive).

• a k b if a 9 b and b 9 a (concurrent program)

A system of logical clocks consists of a time domain T and a logical clock C.The logical clock C is a function
that maps an event e in a distributed system to an element in the time domain T , denoted as
C(e) and called the timestamp of e, and is defined as follows: C : H → T two events ei and e j , ei → e j ⇒
C(ei ) < C(e j ). the system of clocks is said to be strongly consistent.
R1 Before executing an event (send, receive, or internal), process pi executes the following: Ci = Ci + d,
d > 0, d can have a different value, and this value may be application-dependent. However, typically d is kept
at 1 because this is able to identify the time of each event uniquely at a process
R2, When a process pi receives a message with timestamp C msg , it executes the following actions:
1. Ci = max(CiCm sg) ;
2. execute R1;
3. deliver the message.

6
2.4 Leader Election

2.4. Leader Election


Leader election requires that all the processes agree on a common distinguished process, also termed as the
leader.

• Bully- For fully connected network by Garcia-Molina (1982). – P sends an ELECTION message to all
processes with higher numbers.
– If no one responds, P wins the election and becomes coordinator.
– If one of the higher-ups answers, it takes over. P’s job is done.

References

7
Principles and characteristics of
distributed systems and
environments
Definition of a distributed system
 Distributed system is a collection of independent
computers that appears to its users as a single coherent
system.
 Distributed computing is decentralised and parallel
computing, using two or more computers communicating
over a network to accomplish a common objective or
task. The types of hardware, programming languages,
operating systems and other resources may vary
drastically.
 Leslie Lamport: A distributed system is one I which the
failure of a computer you didn’t even know existed can
render your computer unusable.
Definition cont.
 2 aspects
Deals with hardware: machines are autonomous
Deals with software: the users think they are dealing
with a single system
 Important characteristics
Communication is hidden from users
Applications can interact with uniform and consistent
way
Relatively easy to expand or scale (DS will normally be
continuously available)
A distributed system should be functionally equivalent to
the systems of which it is composed.
Distributed system as Middleware service

If the system as a whole looks and acts like a classical


single-processor timesharing system, it qualifies as a
distributed system.
Goals
A DS should easily connect users to
resources; it should hide the fact that
resources are distributed across a
network; should be open and scalable.
4 goals:
Connecting users and resources
Transparency
Openness
Scalability
Connecting users and resources

Share resources (hw components, data)


Groupware
Security aspects
Transparency
Transparent ~ present itself to users and
applications as if it were only a single
computer system

Different forms of transparency in a distributed system


ANSA Reference Manual [1989] and ISO Reference Model
For Open Distributed Processing (RM-ODP) [1992/7]
Openness
 System that offers services according to standard rules
that describe the syntax and semantics
 Generally specified through interfaces, often described
in Interface Definition Language (IDL)
 Interoperability – characterizes two implementations of
system can co-exists and work together
 Portability – characterizes to what extent an application
developed for a distributed system A can be executed,
without modification, on distributed system B that
implements the same interface as A
 Small replaceable and adaptable parts - provide
definitions of not only the highest-level interfaces, those
used by applications, but also definitions for interfaces to
internal parts
Scalability

Measure at least 3 different dimensions


Size – easily add more users and resources to
system
Geographically scalable system – lie far apart
Administratively scalable – independent
administrative organizations
Scalability in one or more dimensions
often means loss of performance – should
not be significant.
Scalability cont.

Example of scalability limitations

 Performance
 Congestion of communication links
 Only decentralized algorithms should be used (no
machine has complete information; decisions based on
local information; failure of one machine does not ruin
the algorithm; no implicit assumption of global clock)
Scaling techniques

3 techniques for scaling


Hiding communication latencies - Asynchronous
communication
Distribution - domains
Replication – caching -> consistency problems !
Why Distributed Systems?

Geographically distributed environment


Speed up
Parallel vs. Distributed Systems (synchronous
SIMD or MIMD; asynchronous process)
Resource sharing
HW and SW (databases)
Fault tolerance
Detection, Recover
Logical distribution of functional capabilities: multiple processes, interprocess
communication, disjoint address space, collective goal
Issues in Distributed Systems
 Knowledge of process
 Identity, Identity of it’s immediate neighbors, communication
channel to neighbors
 Network topology
 Completely or sparsely connected, message routing, channel
uni/bi-directional
 Degree of synchronization
 Clock drift, propagation delay
 Failures
 Type of failure and duration of failure
 Scalabilty
 Time and space complexity (O(log N) ~ excellent, O(N) ~ pure)
Common Subproblems

Leader election
Mutual exclusion
Time synchronization
Global state
Replica management
RM-ODP

Reference Model – Open Distributed


Processing
ITU-T X.9xx standards, X.901 – Overview
Viewpoints (and their languages)
Enterprise
Information
Computational
Engineering
Technology
Hardware concepts
DS consist of multiple CPU – several
different ways of HW organization
Multiprocessors – have shared memory
Multicomputers – without shared memory
Hardware concepts cont.

Architecture of interconnection network


Bus
Switched
Distributed computer system (significant
for multicomputers)
Homogenous – single interconnection network
Heterogeneous – different networks (LAN
connected through FDDI, ATM)
Multiprocessors

Coherent
Overloaded bus –> cache memory
between CPU and bus (hit rate)
Bus suitable until 256 CPUs
Multiprocessors cont.
 Crossbar switch – crosspoint switch
n CPUs and n memories = n2 crosspoint switches
 Omega network
2x2 switches, each two inputs and two outputs
Fast switching, not cheap
 NUMA (NonUniform Memory Access)
Fast access to local memory, slow to other’s
Better average access times than omega networks
Multiprocessors cont.

Crossbar switch Omega network


Homogenous Multicomputer Systems
 Relative easy acording multiprocessors
 CPU-to-CPU and CPU-to-memory
communication
 System Area Networks (SANs)
High performance interconnection network
 Bus-based – Fast Ethernet 25-100 nodes
 Switch-based – messages routed
Mesh, hypercube
Vary widely from Massively Parallel Processors
(MPPs) to Clusters of Workstations (COWs)
Heterogenous Multicomputer Systems

Most distributed systems build on it


Due to scale, inherent heterogeneity, and
most of all, lack of global system view,
sophisticated software is needed to build
apps for heterogeneous multicomputers.
 Provide transparency for apps.
Example
DAS http://www.cs.vu.nl/~bal/das.html
I-way project
Software concepts
 Determines what a DS actually looks like
 Resource manager for underlying hardware
 Operating systems for distributed computers
Tightly-coupled – distributed operating system
(multiprocessors, homogenous multicomputer systems)
Loosely-coupled – network operating system
(heterogeneous multicomputer systems)
Middleware – enhancements to services of network
operating systems – distribution transparency
Software concepts cont.
Uniprocessor Operating Systems
Implement a virtual machine
Apps protected each other
OS should be in full control of how hw
resources are used and shared
Kernel mode
User mode
Architecture
Monolithic kernel
Microkernel
Multiprocessor Operating Systems

Support for multiple CPUs having access


to shared memory
IPC done through memory – protect
against simultaneous access
Synchronization primitives
Semaphore – atomic operations down and up
Monitor – private variable and public
procedures (operations)
Multicomputer Operating Systems

Different structure and complexity than


multiprocessor operating systems
Communication done by message
passing
Software implementation of shared
memory – not always
Message passing primitives may vary
widely in different OS
Multicomputer Operating Systems cont.
Multicomputer Operating Systems cont.

Reliability of message delivery ?


Distributed Shared Memory Systems

Page-based distributed shared memory


Address space divided into pages (4kB or 8kB)
When CPU references address not present
locally, trap occurs, OS fetches the page
Essential normal paging
Distributed Shared Memory Systems
cont.

(a) Pages in DSM


(b) CPU 1 references page 10
(c) Situation if page 10 is read
only and replication used
Network Operation Systems
Do not assume that underlying hardware
is homogeneous and that it should be
managed as if it were a single system
Facilities to allow users to make use of the
services available on specific machine
Service accessed commonly using
commands
rlogin machine
rcp machine1:file1 machine2:file2
Network Operation Systems cont.

Network operating systems are clearly more primitive than distributed


operating systems. The main distinction between the two types of operating
systems is that distributed operating systems make serious attempt to
realize full transparency, that is, provide a single-system view.
Middleware

DOS and NOS don’t qualify definition of


DS
Modern distributed systems are
constructed by means of an additional
layer called middleware.
Additional layer of software between
applications and the network operating
system
Middleware cont.

Goal is to hide heterogeneity of the underlying platforms from applications


Middleware cont.

Middleware models (paradigm)


Everything as file (simple) – Plan 9, UNIX
Distributed File Systems – reasonable scalable
which contributes popularity
Remote Procedure Calls – unaware of network
communication
Distributed Objects – unaware of network
communication
Distributed documents - WWW
Middleware cont.
 Middleware services
high-level Communication facilities - Access
transparency
Naming – allows entities to be shared and accessed
Persistence – through distributed file system or
database
Distributed transactions – atomic multiple read and
write operations; write operation succeeds or fails
Security – can’t rely on the underlying local operating
system, must be implemented in middleware, security
has turned out to be one of the hardest services to
implement
Middleware and Openness
 Modern distributed systems as middleware
 Applications are independent of operating
system
 Independence replaced by strong dependency
of specific middleware
 Need that middleware protocols and interfaces
are the same in different implementations
Comparison between Systems
Client – Server Model

Server – process implementing a specific


service
Client – process that requests service
from server by sending request and
waiting for response
Client – Server Model cont.

Application layering
Distinction of 3 levels
The user-interface level
The processing level
The data level
Multitiered architectures

Three-tier architecture – physically


Modern architectures
 Vertical distribution – multitiered architectures
 Horizontal distribution – distribution split into
equivalent parts (clients or servers)

 Peer-to-peer distribution
Model of Distributed Computing
Q: What is a distributed system?
- Bunch of nodes/processes
- Sending messages over a network
- To solve a common goal (algorithm)

Modelling a Node
A single node has a bunch of neighbours
- Can send & receive messages
- Can do local computations

Model Node by State Transition System (STS)


- Like a finite state machine, except
o Need not be finite
o No input

State Transition System (informal)


A state transition system consists of
- A set of states
- Rule for which state to go from each state (transition function/binary
relation)
- The set of starting states (initial states)
State Transition System (Example)
Example Algorithm:
x: 0
while (x < 2) do
x = x + 1;
endwhile
x: 1

Formally:
- States {X0, X1, X2, X1’}
- Transition function {X0 -> X1, X1 -> X2, X2 -> X1}
- Initial States {X0}

Modelling a Node
State Machine of Node i
- Set of states Qi

Each state consists of


- 1 inbuffer (set)
- 1 outbuffer (set)
- Other data relevant to algorithm

Initial States
- inbuf is empty
State of one Node
Example States

Transition Functions
The state of a node, except outbuf, is called the accessible state of a
node.
Transition function f
• Takes accessible state and gives new state
• Removes at most one message from inbuf in new state
• Adds at zero, one, or more new msgs in outbuf of new state
• Possibly modifies local state
Transition Functions Formally
State of a node is triple <s, I, O>
- s is the local state
- I is inbuf
- O is outbuf
- f state transition function
- f(si, Ii, Oi) -> (si + 1, Ii + 1, Oi + 1)
- Removes at one message m from inbuf
o Ii + 1 = Ii / {m}
- Computes a function f’(si, m) -> (si + 1, {m1, …, mn})
- Adds {m1, …, mn} to outbuf
o Oi + 1 = Oi ∪ {m1, …, mn}

Single Node Perspective


This is how computers in a distributed system works:
- Wait for a message
- When received message, do some local computation, send some
messages
- Goto 1
Is this a correct model?
- Determinism
- Atomicity

Single Node to a Distributed System


A configuration is a snapshot of state of all nodes
- C = (q0, q1, ..., qn – 1) where qi is state of pi
An initial configuration is a configuration where qi is an initial state

The system evolves through events


- Computation event at node i, comp(i)
- Delivery event of msg m from i to j, del(i, j, m)

Computation Event comp(i)


- Apply transition function f on node i’s state

Delivery Event del(i, j, m)


- Move message m from outbuf of pi to inbuf of pi

Execution
An execution is an infinite sequence of
- config0, event1, config1, event2, config2 …
- config0 is an initial configuration

If eventk is comp(i)
- configk – 1 changes to configk by applying pi’s transition function on i’s
state in configk – 1

If eventk is del(i, j, m)
- configk – 1 changes to configk by applying moving m from i’s outbuf to j’s
inbuf

Example Execution

config0 config1 config2

x=1 x=1 x=1

outbuf={m1} outbuf={m1} P1 outbuf={m1}


P1 P1
inbuf={} inbuf={} inbuf={}

Event1 Event2
x=4 del(2,3,m7) x=4 Comp(3) x=1

P2 outbuf={m7}
P2
outbuf={} P2 outbuf={}

inbuf={} inbuf={} inbuf={}

x=11 x=1 x=11

outbuf={} outbuf={} outbuf={m7}


P3 P3 P3
inbuf={} inbuf={m7} inbuf={}
Chapter 2: A Model of Distributed Computations

Ajay Kshemkalyani and Mukesh Singhal

Distributed Computing: Principles, Algorithms, and Systems

Cambridge University Press

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 1/1
Distributed Computing: Principles, Algorithms, and Systems

A Distributed Program

A distributed program is composed of a set of n asynchronous processes, p1 ,


p2 , ..., pi , ..., pn .
The processes do not share a global memory and communicate solely by
passing messages.
The processes do not share a global clock that is instantaneously accessible
to these processes.
Process execution and message transfer are asynchronous.
Without loss of generality, we assume that each process is running on a
different processor.
Let Cij denote the channel from process pi to process pj and let mij denote a
message sent by pi to pj .
The message transmission delay is finite and unpredictable.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 2/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

The execution of a process consists of a sequential execution of its actions.


The actions are atomic and the actions of a process are modeled as three
types of events, namely, internal events, message send events, and message
receive events.
Let eix denote the xth event at process pi .
For a message m, let send(m) and rec(m) denote its send and receive events,
respectively.
The occurrence of events changes the states of respective processes and
channels.
An internal event changes the state of the process at which it occurs.
A send event changes the state of the process that sends the message and
the state of the channel on which the message is sent.
A receive event changes the state of the process that receives the message
and the state of the channel on which the message is received.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 3/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

The events at a process are linearly ordered by their order of occurrence.


The execution of process pi produces a sequence of events ei1 , ei2 , ..., eix ,
eix+1 , ... and is denoted by Hi where
Hi = (hi , →i )
hi is the set of events produced by pi and
binary relation →i defines a linear order on these events.
Relation →i expresses causal dependencies among the events of pi .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 4/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

The send and the receive events signify the flow of information between
processes and establish causal dependency from the sender process to the
receiver process.
A relation →msg that captures the causal dependency due to message
exchange, is defined as follows. For every message m that is exchanged
between two processes, we have
send(m) →msg rec(m).
Relation →msg defines causal dependencies between the pairs of
corresponding send and receive events.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 5/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

The evolution of a distributed execution is depicted by a space-time diagram.


A horizontal line represents the progress of the process; a dot indicates an
event; a slant arrow indicates a message transfer.
Since we assume that an event execution is atomic (hence, indivisible and
instantaneous), it is justified to denote it as a dot on a process line.
In the Figure 2.1, for process p1 , the second event is a message send event,
the third event is an internal event, and the fourth event is a message receive
event.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 6/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

e1 e12 e13 e14 e15


1
p
1

e21 e22 e23 e42 e62


p
2
e52
e31 e33
p
3
e32 e34
time

Figure 2.1: The space-time diagram of a distributed execution.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 7/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

Causal Precedence Relation


The execution of a distributed application results in a set of distributed
events produced by the processes.
Let H=∪i hi denote the set of events executed in a distributed computation.
Define a binary relation → on the set H as follows that expresses causal
dependencies between events in the distributed execution.
eix →i ejy i.e., (i = j) ∧ (x < y )



 or


x y x
∀ei , ∀ej ∈ H, ei → ej y
⇔ eix →msg ejy
 or


∃ekz ∈ H : eix → ekz ∧ ekz → ejy

The causal precedence relation induces an irreflexive partial order on the


events of a distributed computation that is denoted as H=(H, →).

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 8/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

. . . Causal Precedence Relation


Note that the relation → is nothing but Lamport’s “happens before” relation.
For any two events ei and ej , if ei → ej , then event ej is directly or
transitively dependent on event ei . (Graphically, it means that there exists a
path consisting of message arrows and process-line segments (along
increasing time) in the space-time diagram that starts at ei and ends at ej .)
For example, in Figure 2.1, e11 → e33 and e33 → e26 .
The relation → denotes flow of information in a distributed computation and
ei → ej dictates that all the information available at ei is potentially
accessible at ej .
For example, in Figure 2.1, event e26 has the knowledge of all other events
shown in the figure.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 9/1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

. . . Causal Precedence Relation


For any two events ei and ej , ei 6→ ej denotes the fact that event ej does not
directly or transitively dependent on event ei . That is, event ei does not
causally affect event ej .
In this case, event ej is not aware of the execution of ei or any event
executed after ei on the same process.
For example, in Figure 2.1, e13 6→ e33 and e24 6→ e31 .
Note the following two rules:
For any two events ei and ej , ei 6→ ej 6⇒ ej 6→ ei .
For any two events ei and ej , ei → ej ⇒ ej 6→ ei .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 10 / 1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions

Concurrent events
For any two events ei and ej , if ei 6→ ej and ej 6→ ei ,
then events ei and ej are said to be concurrent (denoted as ei k ej ).
In the execution of Figure 2.1, e13 k e33 and e24 k e31 .
The relation k is not transitive; that is, (ei k ej ) ∧ (ej k ek ) 6⇒ ei k ek .
For example, in Figure 2.1, e33 k e24 and e24 k e15 , however, e33 6k e15 .
For any two events ei and ej in a distributed execution,
ei → ej or ej → ei , or ei k ej .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 11 / 1
Distributed Computing: Principles, Algorithms, and Systems

A Model of Distributed Executions


Logical vs. Physical Concurrency
In a distributed computation, two events are logically concurrent if and only
if they do not causally affect each other.
Physical concurrency, on the other hand, has a connotation that the events
occur at the same instant in physical time.
Two or more events may be logically concurrent even though they do not
occur at the same instant in physical time.
However, if processor speed and message delays would have been different,
the execution of these events could have very well coincided in physical time.
Whether a set of logically concurrent events coincide in the physical time or
not, does not change the outcome of the computation.
Therefore, even though a set of logically concurrent events may not have
occurred at the same instant in physical time, we can assume that these
events occured at the same instant in physical time.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 12 / 1
Distributed Computing: Principles, Algorithms, and Systems

Models of Communication Networks

There are several models of the service provided by communication networks,


namely, FIFO, Non-FIFO, and causal ordering.
In the FIFO model, each channel acts as a first-in first-out message queue
and thus, message ordering is preserved by a channel.
In the non-FIFO model, a channel acts like a set in which the sender process
adds messages and the receiver process removes messages from it in a
random order.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 13 / 1
Distributed Computing: Principles, Algorithms, and Systems

Models of Communication Networks

The “causal ordering” model is based on Lamport’s “happens before”


relation.
A system that supports the causal ordering model satisfies the following
property:
CO: For any two messages mij and mkj , if send(mij ) −→
send(mkj ), then rec(mij ) −→ rec(mkj ).
This property ensures that causally related messages destined to the same
destination are delivered in an order that is consistent with their causality
relation.
Causally ordered delivery of messages implies FIFO message delivery. (Note
that CO ⊂ FIFO ⊂ Non-FIFO.)
Causal ordering model considerably simplifies the design of distributed
algorithms because it provides a built-in synchronization.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 14 / 1
Distributed Computing: Principles, Algorithms, and Systems

Global State of a Distributed System

“A collection of the local states of its components, namely,


the processes and the communication channels.”
The state of a process is defined by the contents of processor registers,
stacks, local memory, etc. and depends on the local context of the
distributed application.
The state of channel is given by the set of messages in transit in the channel.
The occurrence of events changes the states of respective processes and
channels.
An internal event changes the state of the process at which it occurs.
A send event changes the state of the process that sends the message and
the state of the channel on which the message is sent.
A receive event changes the state of the process that or receives the message
and the state of the channel on which the message is received.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 15 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Global State of a Distributed System

Notations
LSix denotes the state of process pi after the occurrence of event eix and
before the event eix+1 .
LSi0 denotes the initial state of process pi .
LSix is a result of the execution of all the events executed by process pi till eix .
Let send(m)≤LSix denote the fact that ∃y :1≤y ≤x :: eiy =send(m).
Let rec(m)6≤LSix denote the fact that ∀y :1≤y ≤x :: eiy 6=rec(m).

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 16 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Global State of a Distributed System

A Channel State
The state of a channel depends upon the states of the processes it connects.
Let SCijx,y denote the state of a channel Cij .
The state of a channel is defined as follows:

SCijx,y ={mij | send(mij ) ≤ eix rec(mij ) 6≤ ejy }


V

Thus, channel state SCijx,y denotes all messages that pi sent upto event eix and
which process pj had not received until event ejy .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 17 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Global State of a Distributed System

Global State
The global state of a distributed system is a collection of the local states of
the processes and the channels.
Notationally, global state GS is defined as,
y ,z
GS = { i LSixi , j,k SCjkj k }
S S

For a global state to be meaningful, the states of all the components of the
distributed system must be recorded at the same instant.
This will be possible if the local clocks at processes were perfectly
synchronized or if there were a global system clock that can be
instantaneously read by the processes. (However, both are impossible.)

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 18 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Global State of a Distributed System


A Consistent Global State
Even if the state of all the components is not recorded at the same instant,
such a state will be meaningful provided every message that is recorded as
received is also recorded as sent.
Basic idea is that a state should not violate causality – an effect should not
be present without its cause. A message cannot be received if it was not sent.
Such states are called consistent global states and are meaningful global
states.
Inconsistent global states are not meaningful in the sense that a distributed
system can never be in an inconsistent state.
y ,z
A global state GS = { i LSixi , j,k SCjkj k } is a consistent global state iff
S S
x ,y V y
∀mij : send(mij ) 6≤ LSixi ⇔ mij 6∈ SCij i j rec(mij ) 6≤ LSj j
That is, channel state SCijyi ,zk and process state LSjzk must not include any
message that process pi sent after executing event eixi .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 19 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Global State of a Distributed System


An Example
Consider the distributed execution of Figure 2.2.
Figure 2.2: The space-time diagram of a distributed execution.

e11 e21 e13 e 14


p
1
m 4 m 21
e21 12 e22 e23 e2
p2
1
e3 e 23 e 33 e34 e35
p3

e41 e42
p4

time

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 20 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Global State of a Distributed System

In Figure 2.2:
A global state GS1 = {LS11 , LS23 , LS33 , LS42 } is inconsistent
because the state of p2 has recorded the receipt of message m12 , however,
the state of p1 has not recorded its send.
A global state GS2 consisting of local states {LS12 , LS24 , LS34 , LS42 }
is consistent; all the channels are empty except C21 that
contains message m21 .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 21 / 1
Distributed Computing: Principles, Algorithms, and Systems

Cuts of a Distributed Computation

“In the space-time diagram of a distributed computation, a cut is a


zigzag line joining one arbitrary point on each process line.”
A cut slices the space-time diagram, and thus the set of events in the
distributed computation, into a PAST and a FUTURE.
The PAST contains all the events to the left of the cut and the FUTURE
contains all the events to the right of the cut.
For a cut C , let PAST(C ) and FUTURE(C ) denote the set of events in the
PAST and FUTURE of C , respectively.
Every cut corresponds to a global state and every global state can be
graphically represented as a cut in the computation’s space-time diagram.
Cuts in a space-time diagram provide a powerful graphical aid in representing
and reasoning about global states of a computation.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 22 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Cuts of a Distributed Computation


Figure 2.3: Illustration of cuts in a distributed execution.

C1 C2
e11 e21 e3
1
e 14
p
1
4
e12 e22 e23 e2
p2
1
e3 e 23 e 33 e34 e35
p3

e41 e2
p4 4

time

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 23 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Cuts of a Distributed Computation

In a consistent cut, every message received in the PAST of the cut was sent
in the PAST of that cut. (In Figure 2.3, cut C2 is a consistent cut.)
All messages that cross the cut from the PAST to the FUTURE are in transit
in the corresponding consistent global state.
A cut is inconsistent if a message crosses the cut from the FUTURE to the
PAST. (In Figure 2.3, cut C1 is an inconsistent cut.)

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 24 / 1
Distributed Computing: Principles, Algorithms, and Systems

Past and Future Cones of an Event

Past Cone of an Event


An event ej could have been affected only by all events ei such that ei → ej .
In this situtaion, all the information available at ei could be made accessible
at ej .
All such events ei belong to the past of ej .
Let Past(ej ) denote all events in the past of ej in a computation (H, →). Then,
Past(ej ) = {ei |∀ei ∈ H, ei → ej }.

Figure 2.4 (next slide) shows the past of an event ej .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 25 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Past and Future Cones of an Event


Figure 2.4: Illustration of past and future cones.

max(Past (e )) min(Future (ej ))


i j i

PAST( ej ) FUTURE( e j )

e
j

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 26 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Past and Future Cones of an Event

Let Pasti (ej ) be the set of all those events of Past(ej ) that are on process pi .
Pasti (ej ) is a totally ordered set, ordered by the relation →i , whose maximal
element is denoted by max(Pasti (ej )).
max(Pasti (ej )) is the latest event at process pi that affected event ej (Figure
2.4).

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 27 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Past and Future Cones of an Event

S
Let Max Past(ej ) = (∀i ) {max(Pasti (ej ))}.
Max Past(ej ) consists of the latest event at every process that affected event
ej and is referred to as the surface of the past cone of ej .
Past(ej ) represents all events on the past light cone that affect ej .
Future Cone of an Event
The future of an event ej , denoted by Future(ej ), contains all events ei that
are causally affected by ej (see Figure 2.4).
In a computation (H, →), Future(ej ) is defined as:
Future(ej ) = {ei |∀ei ∈ H, ej → ei }.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 28 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Past and Future Cones of an Event

Define Futurei (ej ) as the set of those events of Future(ej ) that are on process
pi .
define min(Futurei (ej )) as the first event on process pi that is affected by ej .
S
Define Min Future(ej ) as (∀i ) {min(Futurei (ej ))}, which consists of the first
event at every process that is causally affected by event ej .
Min Future(ej ) is referred to as the surface of the future cone of ej .
All events at a process pi that occurred after max(Pasti (ej )) but before
min(Futurei (ej )) are concurrent with ej .
Therefore, all and only those events of computation H that belong to the set
“H − Past(ej ) − Future(ej )” are concurrent with event ej .

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 29 / 1
Distributed Computing: Principles, Algorithms, and Systems

Models of Process Communications


There are two basic models of process communications – synchronous and
asynchronous.
The synchronous communication model is a blocking type where on a
message send, the sender process blocks until the message has been received
by the receiver process.
The sender process resumes execution only after it learns that the receiver
process has accepted the message.
Thus, the sender and the receiver processes must synchronize to exchange a
message. On the other hand,
asynchronous communication model is a non-blocking type where the sender
and the receiver do not synchronize to exchange a message.
After having sent a message, the sender process does not wait for the
message to be delivered to the receiver process.
The message is bufferred by the system and is delivered to the receiver
process when it is ready to accept the message.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 30 / 1
Distributed Computing: Principles, Algorithms, and Systems

. . . Models of Process Communications

Neither of the communication models is superior to the other.


Asynchronous communication provides higher parallelism because the sender
process can execute while the message is in transit to the receiver.
However, A buffer overflow may occur if a process sends a large number of
messages in a burst to another process.
Thus, an implementation of asynchronous communication requires more
complex buffer management.
In addition, due to higher degree of parallelism and non-determinism, it is
much more difficult to design, verify, and implement distributed algorithms
for asynchronous communications.
Synchronous communication is simpler to handle and implement.
However, due to frequent blocking, it is likely to have poor performance and
is likely to be more prone to deadlocks.

A. Kshemkalyani and M. Singhal (Distributed Computing) A Model of Distributed Computations CUP 2008 31 / 1
Distributed Systems Fö 2/3- 1 Distributed Systems Fö 2/3- 2

MODELS OF DISTRIBUTED SYSTEMS Basic Elements

• Resources in a distributed system are shared


between users. They are normally encapsulated
within one of the computers and can be accessed
from other computers by communication.
1. Architectural Models

• Each resource is managed by a program, the


resource manager; it offers a communication
2. Interaction Models interface enabling the resource to be accessed by
its users.

3. Fault Models
• Resource managers can be in general modelled as
processes.
If the system is designed according to an object-
oriented methodology, resources are encapsulated
in objects.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 3 Distributed Systems Fö 2/3- 4

Architectural Models Client - Server

How are responsibilities distributed between system


components and how are these components placed?
☞ The system is structured as a set of processes,
called servers, that offer services to the users, called
• Client-server model clients.

• Peer-to-peer

• The client-server model is usually based on a


simple request/reply protocol, implemented with
Variations of the above two:
send/receive primitives or using remote procedure
calls (RPC) or remote method invocation (RMI):
• Proxy server - the client sends a request (invocation) message
• Mobile code to the server asking for some service;
• Mobile agents - the server does the work and returns a result
• Network computers (e.g. the data requested) or an error code if the
work could not be performed.
• Thin clients
• Mobile devices

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 5 Distributed Systems Fö 2/3- 6

Client - Server (cont’d) Peer-to-Peer

☞ All processes (objects) play similar role.

client • Processes (objects) interact without particular


client distinction between clients and servers.

• The pattern of communication depends on the


server
particular application.

client server • A large number of data objects are shared; any


individual computer holds only a small part of the
application database.

request: process (object): • Processing and communication loads for access to


result: objects are distributed across many computers and
computer (node):
access links.

• This is the most general and flexible model.

• A server can itself request services from other


servers; thus, in this new relation, the server itself peer peer
acts like a client.

peer peer

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 7 Distributed Systems Fö 2/3- 8

Peer-to-Peer (cont’d) Variations of the Basic Models

☞ Some problems with client-server:

• Centralisation of service ⇒ poor scaling ☞ Client-server and peer-to-peer can be considered as


basic models.
- Limitations:
capacity of server
bandwidth of network connecting the server

• Several variations have been proposed, with


considering factors such as:

☞ Peer-to-Peer tries to solve some of the above - multiple servers and caches
- mobile code and mobile agents
• It distributes shared resources widely
- low-cost computers at the users’ side
- mobile devices

share computing and communication loads.

☞ Problems with peer-to-peer:

• High complexity due to


- cleverly place individual objects
- retrieve the objects
- maintain potentially large number of replicas.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 9 Distributed Systems Fö 2/3- 10

Proxy Server Mobile Code

☞ A proxy server provides copies (replications) of


resources which are managed by other servers. ☞ Mobile code: code that is sent from one computer to
another and run at the destination.

server Advantage: remote invocations are replaced by local ones.


client

Typical example: Java applets.


proxy
server
Step 1: load applet
client server
client server
applet code

• Proxy servers are typically used as caches for web Step 2: interact with applet
resources. They maintain a cache of recently
visited web pages or other resources.
When a request is issued by a client, the proxy client applet server
server is first checked, if the requested object
(information item) is available there.

• Proxy servers can be located at each client, or can


be shared by several clients.

• The purpose is to increase performance and


availability, by avoiding frequent accesses to
remote servers.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 11 Distributed Systems Fö 2/3- 12

Mobile Agents Network Computers

Network computers
☞ Mobile agent: a running program that travels from
one computer to another carrying out a task on
someone’s behalf.

• A mobile agent is a complete program, code + data,


that can work (relatively) independently.
Network
• The mobile agent can invoke local resources/data.

Typical tasks:

• Collect information servers


• Install/maintain software on computers
• Compare prises from various vendors bay visiting ☞ Network computers do not store locally operating
their sites. system or application code. All code is loaded from
the servers and run locally on the network computer.

Advantages:
Attention: potential security risk (like mobile code)!
• The network computer can be simpler, with limited
capacity; it does not need even a local hard disk (if
there exists one it is used to cache data or code).
• Users can log in from any computer.
• No user effort for software management/
administration.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 13 Distributed Systems Fö 2/3- 14

Thin Clients Mobile Devices

☞ Mobile devices are hardware, computing


☞ The thin client is a further step, beyond the network components that move (together with their software)
computer: between physical locations.
• Thin clients do not download code (operating
system or application) from the server to run it • This is opposed to software agents, which are
locally. All code is run on the server, in parallel for software components that migrate.
several clients.
• The thin client only runs the user interface! • Both clients and servers can be mobile (clients
more frequently).

Advantages: ☞ Particular problems/issues:


• All those of network computers but the computer at
the user side is even simpler (cheaper). • Mobility transparency: clients should not be aware if
the server moves (e.g., the server keeps its Internet
address even if it moves between networks).
• Problems due to variable connectivity and
☞ Strong servers are needed! bandwidth.
• The device has to explore its environment:
- Spontaneous interoperation: associations
between devices (e.g. clients and servers) are
dynamically created and destroyed.
- Context awareness: available services are
dependent on the physical environment in
which the device is situated.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 15 Distributed Systems Fö 2/3- 16

Synchronous Distributed Systems


Interaction Models
Main features:

• Lower and upper bounds on execution time of


processes can be set.
How do we handle time? Are there time limits on process
execution, message delivery, and clock drifts? • Transmitted messages are received within a known
bounded time.

• Synchronous distributed systems • Drift rates between local clocks have a known
bound.

• Asynchronous distributed systems Important consequences:

1. In a synchronous distributed system there is a


notion of global physical time (with a known relative
precision depending on the drift rate).

2. Only synchronous distributed systems have a


predictable behaviour in terms of timing. Only such
systems can be used for hard real-time
applications.

3. In a synchronous distributed system it is possible


and safe to use timeouts in order to detect failures
of a process or communication link.

☞ It is difficult and costly to implement synchronous


distributed systems.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 17 Distributed Systems Fö 2/3- 18

Asynchronous Distributed Systems


Asynchronous Distributed Systems (cont’d)

☞ Many distributed systems (including those on the


Internet) are asynchronous.

• No bound on process execution time (nothing can ☞ Asynchronous systems are widely and successfully
be assumed about speed, load, reliability of used in practice.
computers).

• No bound on message transmission delays In practice timeouts are used with asynchronous
(nothing can be assumed about speed, load, systems for failure detection.
reliability of interconnections) However, additional measures have to be applied in
order to avoid duplicated messages, duplicated
• No bounds on drift rates between local clocks. execution of operations, etc.

Important consequences:

1. In an asynchronous distributed system there is no


global physical time. Reasoning can be only in
terms of logical time (see lecture on time and
state).

2. Asynchronous distributed systems are


unpredictable in terms of timing.

3. No timeouts can be used.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 19 Distributed Systems Fö 2/3- 20

Fault Models Omission Faults

What kind of faults can occur and what are their effects?

• Omission faults ☞ A processor or communication channel fails to


perform actions it is supposed to do.
• Arbitrary faults This means that the particular action is not performed!

• Timing faults • We do not have an omission fault if:


- An action is delayed (regardless how long) but
finally executed.
☞ Faults can occur both in processes and - An action is executed with an erroneous result.
communication channels. The reason can be both
software and hardware faults.
☞ With synchronous systems, omission faults can be
☞ Fault models are needed in order to build systems detected by timeouts.
with predictable behaviour in case of faults (systems • If we are sure that messages arrive, a timeout
which are fault tolerant). will indicate that the sending process has
crashed. Such a system has a fail-stop
☞ Of course, such a system will function according to behaviour.
the predictions, only as long as the real faults behave
as defined by the “fault model”. If not .......

☞ These issues will be discussed in some of the


following chapters and in particular in the chapter on
“Recovery and Fault Tolerance”.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 21 Distributed Systems Fö 2/3- 22

Arbitrary (Byzantine) Faults Summary

• Models can be used to provide an abstract and


☞ This is the most general and worst possible fault simplified description of certain relevant aspects of
semantics. distributed systems.

Intended processing steps or communications are • Architectural models define the way responsibilities
omitted or/and unintended ones are executed. are distributed among components and how they
Results may not come at all or may come but carry are placed in the system.
wrong values.
We have studied three architectural models:
1. Client-server model
2. Peer-to-peer
3. Several variations of the two

Timing Faults
• Interaction models deal with how time is handled
throughout the system.

☞ Timing faults can occur in synchronous distributed Two interaction models have been introduced:
systems, where time limits are set to process 1. Synchronous distributed systems
execution, communications, and clock drifts. 2. Asynchronous distributed systems
A timing fault occurs if any of this time limits is
exceeded. • The fault model specifies what kind of faults can
occur and what their effects are.

Fault models:
1. Omission faults
2. Arbitrary faults
3. Timing faults

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 23 Distributed Systems Fö 2/3- 24

COMMUNICATION IN DISTRIBUTED Communication Models and their Layered


SYSTEMS Implementation

Applications & Services


1. Communication System: Layered Implementation
Middleware

RMI, RPC
2. Network Protocol
Request&Reply

Operating System&Network Protocol


3. Request and Reply Primitives
Hardware: Computer&Network

4. RMI and RPC

• This chapter concentrates on communication


5. RMI and RPC Semantics and Failures between distributed objects by means of two
models: remote method invocation (RMI) and
remote procedure call (RPI).

6. Group Communication • RMI, as well as RPC, are based on request and


reply primitives.

• Request and reply are implemented based on the


network protocol (e.g. TCP or UDP in case of the
Internet).

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 25 Distributed Systems Fö 2/3- 26

Network Protocol (cont’d)


Network Protocol
☞ TCP is a reliable protocol.
• TCP guarantees the delivery to the receiving
process of all data delivered by the sending
☞ Middleware and distributed applications have to be process, in the same order.
implemented on top of a network protocol. Such a
protocol is implemented as several layers. • TCP implements additional mechanisms on top of
IP to meet reliability guarantees.
- Sequencing:
In case of the Internet: A sequence number is attached to each
transmitted segment (packet). At the receiver
side, no segment is delivered until all lower-
numbered segments have been delivered.
Applications & Services - Flow control:
The sender takes care not to overwhelm the
Middleware receiver (or intermediate nodes). This is based
on periodic acknowledgements received by the
TCP or UDP sender from the receiver.
IP
- Retransmission and duplicate handling:
If a segment is not acknowledged within a
lower level layers
specified timeout, the sender retransmits it.
Based on the sequence number, the receiver is
able to detect and reject duplicates.
- Buffering:
• TCP (Transport Control Protocol) and UDP (User Buffering is used to balance the flow between
Datagram Protocol) are both transport protocols sender and receiver. If the receiving buffer is
implemented on top of the Internet protocol (IP). full, incoming segments are dropped. They will
not be acknowledged and the sender will
retransmit them.
- Checksum:
Each segment carries a checksum. If the
received segment doesn’t match the checksum,
it is dropped (and will be retransmitted)

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 27 Distributed Systems Fö 2/3- 28

Network Protocol (cont’d)

Request and Reply Primitives

☞ UDP is a protocol that does not guarantee reliable


transmission.
☞ Communication between processes and objects in a
distributed system is performed by message passing.
• UDP offers no guarantee of delivery.
According to the IP, packets may be dropped
because of congestion or network error. UDP adds
no additional reliability mechanism to this. • In a typical scenario (e.g. client-server model) such
a communication is through request and reply
• UDP provides a means of transmitting messages messages.
with minimal additional costs or transmission delays
above those due to IP transmission.
Its use is restricted to applications and services that
do not require reliable delivery of messages.

• If reliable delivery is requested with UDP, reliability


mechanisms have to be implemented at the
application level.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 29 Distributed Systems Fö 2/3- 30

Request-Reply Communication in a Remote Method Invocation (RMI) and


Client-Server Model Remote Procedure Call (RPC)

The system is structured as a group of processes


(objects), called servers, that deliver services to clients.
Applications & Services
Request
Client Server

Middleware
Reply RMI, RPC

Request&Reply
Network
Operating System&Network Protocol

The client: Hardware: Computer&Network


------------------
send (request) to server_reference;
receive(reply); The goal: make, for the programmer, distributed
------------------ computing look like centralized computing.

The solution:
The server: - Asking for a service is solved by the client
issuing a simple method invocation or procedure
------------------ call; because the server can be on a remote
receive(request) from client-reference; machine this is a remote invocation (call).
execute requested operation
send (reply) to client_reference; - RMI (RPC) is transparent: the calling object
(procedure) is not aware that the called one is
------------------
executing on a different machine, and vice
versa.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 31 Distributed Systems Fö 2/3- 32


Server

Communication Remote reference


module
Object B

Remote Method Invocation

Request
Invocation
local ref.

local ref.

Client
Answer

Server
local ref. ↔ rem. ref.

Reply
module
Skeleton for B

Messages over network


Unmarshal

Network
arguments

Marshal
results
Implementation of RMI

The client writes:


------------------ local ref.
server_id.service(values_to_server, result_arguments);
rem. ref.

local ref.
Request

rem. ref.
Reply

------------------
local ref.
local ref.
The server contains the method:
Remote reference Communication
Proxy for B

Unmarshal
arguments

local ref. ↔ rem. ref.


Marshal

results

public service(in type1 arg_from_client; out type2 arg_to_client)


module

{ - - - };
Invocation
f.

Answer
local ref.
local re

• The programmer is unaware of the request and


reply messages which are sent over the network
during execution of the RMI.
Object A

module
Client

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 33 Distributed Systems Fö 2/3- 34

Implementation of RMI (cont’d)

Implementation of RMI (cont’d)


Who are the players?

• Object A asks for a service


• The skeleton for object B
• Object B delivers the service

- On the server side, there exists a skeleton


Who more? object corresponding to a class, if an object of
that class can be accessed by RMI. For each
• The proxy for object B method in B there exists a corresponding
method in the skeleton.
- If an object A holds a remote reference to a
(remote) object B, there exists a proxy object for - The skeleton receives the request message,
B on the machine which hosts A. The proxy is unmarshals it and invokes the corresponding
created when the remote object reference is method in the remote object; it waits for the
used for the first time. For each method in B result and marshals it into the message to be
there exists a corresponding method in the proxy. sent with the reply.

- The proxy is the local representative of the - A part of the skeleton is also called dispatcher.
remote object ⇒ the remote invocation from A The dispatcher receives a request from the
to B is initially handled like a local one from A to communication module, identifies the invoked
the proxy for B. method and directs the request to the
corresponding method of the skeleton.
- At invocation, the corresponding proxy method
marshals the arguments and builds the message
to be sent, as a request, to the server.
After reception of the reply, the proxy
unmarshals the received message and sends
the results, in an answer, to the invoker.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 35 Distributed Systems Fö 2/3- 36

Implementation of RMI (cont’d) Implementation of RMI (cont’d)

• Communication module ☞ Question 1

What if the two computers use different


- The communication modules on the client and
representation for data (integers, chars, floating
server are responsible of carrying out the
point)?
exchange of messages which implement the
request/reply protocol needed to execute the
remote invocation. • The most elegant and flexible solution is to have a
standard representation used for all values sent
- The particular messages exchanged and the through the network; the proxy and skeleton
way errors are handled, depends on the RMI convert to/from this representation during
semantics which is implemented (see slide 40). marshalling/unmarshalling.

• Remote reference module ☞ Question 2

Who generates the classes for proxy and skeleton?


- The remote reference module translates
between local and remote object references.
The correspondence between them is recorded • In advanced middleware systems (e.g. CORBA) the
in a remote object table. classes for proxies and skeletons can be generated
automatically.
- Remote object references are initially obtained Given the specification of the server interface and
by a client from a so called binder that is part of the standard representations, an interface compiler
the global name service (it is not part of the can generate the classes for proxies and skeletons.
remote reference module). Here servers
register their remote objects and clients look up
after services.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 37 Distributed Systems Fö 2/3- 38

The History of an RMI


Implementation of RMI (cont’d)
1. The calling sequence in the client object activates
the method in the proxy corresponding to the
invoked method in B.
2. The method in the proxy packs the arguments into
☞ Object A and Object B belong to the application. a message (marshalling) and forwards it to the
communication module.
3. Based on the remote reference obtained from the
remote reference module, the communication
☞ Remote reference module and communication module initiates the request/reply protocol over the
module belong to the middleware. network.
4. The communication module on the server’s
machine receives the request. Based on the local
☞ The proxy for B and the skeleton for B represent the reference received from the remote reference
so called RMI software. They are situated at the module the corresponding method in the skeleton
border between middleware and application and for B is activated.
usually can be generated automatically with help of 5. The skeleton method extracts the arguments from
available tools that are delivered together with the the received message (unmarshalling) and activates
middleware software. the corresponding method in the server object B.
6. After receiving the results from B, the method in the
skeleton packs them into the message to be sent
back (marshalling) and forwards this message to
the communication module.
7. The communication module sends the reply,
through the network, to the client’s machine.
8. The communication module receives the reply and
forwards it to the corresponding method in the proxy.
9. The proxy method extracts the results from the
received message (unmarshalling) and forwards
them to the client.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 39 Distributed Systems Fö 2/3- 40

Remote Procedure Call RMI Semantics and Failures

• If everything works OK, RMI behaves exactly like a


Client stub Sever stub local invocation. What if certain failures occur?

Call Marshal Unmarsh. Call


arguments arguments
Client Server We consider the following classes of failures which have
Unmarsh. Marshal to be handled by an RMI protocol:
Return results results Return
1. Lost request message
2. Lost reply message
3. Server crash
Request 4. Client crash
Remote reference Remote reference
module module
Communication Communication
Reply module
module ☞ We will consider an omission failure model.
This means:
Messages over network
- Messages are either lost or received correctly.
- Client or server processes either crash or
execute correctly. After crash the server can
possibly restart with or without loss of memory.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 41 Distributed Systems Fö 2/3- 42

Lost Request Messages Lost Reply Message

The client can not really distinguish the loss of a request


• The communication module starts a timer when from that of a reply; it simply resends the request
sending the request; because no answer has been received in the right time.
if the timer expires before a reply or acknowledgment ☞ If the reply really got lost, when the duplicate request
comes back, the communication module sends the arrives at the server it already has executed the
message again. operation once!
☞ In order to resend the reply the server may need to
reexecute the operation in order to get the result.

Problem: what if the request was not truly lost (but, for Danger!
example, the server is too slow) and the server • Some operations can be executed more than once
receives it more than once? without any problem; they are called idempotent
operations ⇒ no danger with executing the
duplicate request.
• We have to avoid that the server executes certain
operations more than once. • There are operations which cannot be executed
repeatedly without changing the effect (e.g.
• Messages have to be identified by an identifier and
transferring an amount of money between two
copies of the same message have to be filtered out:
accounts) ⇒ history can be used to avoid re-
- If the duplicate arrives and the server has not execution.
yet sent the reply ⇒ simply send the reply.
- If the duplicate arrives after the reply has been
History: the history is a structure which stores a record
sent ⇒ the reply may have been lost or it didn’t
of reply messages that have been transmitted,
arrive in time (see next slide).
together with the message identifier and the
client which it has been sent to.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 43 Distributed Systems Fö 2/3- 44

Conclusion with Lost Messages Server Crash

a) The normal sequence:

☞ Based on the previous discussion ⇒ correct, exactly Server


Request
once semantics (see slide 46) can be implemented in receive
the case of lost (request or reply) messages. If all the execute
measures are taken (duplicate filtering and history): Reply reply
- When, finally, a reply arrives at the client, the
call has been executed correctly (exactly one
time).
- If no answer arrives at the client (e.g. because b) The server crashes after executing the operation but
of broken line), an operation has been executed before sending the reply (as result of the crash, the
at most one time. server doesn’t remember that it has executed the
operation):
Server
Request
receive
☞ However, the situation is different if we assume that execute
the server can crash. no Reply crash

c) The server crashes before executing the operation:

Server
Request
receive
no Reply crash

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 45 Distributed Systems Fö 2/3- 46

Server Crash (cont’d)

Alternative 1: at least once semantics


Server Crash (cont’d)
• The client’s communication module sends
repeated requests and waits until the server
reboots or it is rebound to a new machine; when it
finally receives a reply, it forwards it to the client.

Big problem!
The client cannot distinguish between cases b and c!
When the client got an answer, the RMI has been
However they are very different and should be handled in carried out at least one time, but possibly more.
a different way!
Alternative 2: at most once semantics
• The client’s communication module gives up and
immediately reports the failure to the client (e.g. by
What to do if the client noticed that the server is down raising an exception)
(it didn’t answer to a certain large number of repeated
requests)?
- If the client got an answer, the RMI has been
executed exactly once.
- If the client got a failure message, the RMI has
been carried out at most one time, but possibly
not at all.

Alternative 3: exactly once semantics


• This is what we would like to have (and what we
could achieve for lost messages): the RMI has
been carried out exactly one time.
However this cannot be guaranteed, in general,
for the situation of server crashes.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 47 Distributed Systems Fö 2/3- 48

Client Crash Conclusion with RMI Semantics and Failures

☞ If the problem of errors is ignored, maybe semantics


The client sends a request to a server and crashes is achieved for RMI:
before the server replies. - the client, in general, doesn’t know if the remote
method has been executed once, several times
or not at all.
The computation which is active in the server becomes
an orphan - a computation nobody is waiting for. ☞ If server crashes can be excluded, exactly once
semantics is possible to achieve, by using retries,
filtering out duplicates, and using history.

Problems:
☞ If server crashes with loss of memory (case b on
• wasting of CPU time slide 44) are considered, only at least once and at
• locked resources (files, peripherals, etc.) most once semantics are achievable in the best case.
• if the client reboots and repeats the RMI, confusion
can be created. In practical applications, servers can survive crashes
without loss of memory. In such cases history can be
used and duplicates can be filtered out after restart of
the server:
The solution is based on identification and killing the
orphans. • the client repeats sending requests without being in
danger operations to be executed more than one
time (this is different from alternative 2 on slide 46):
- If no answer is received after a certain amount
of tries, the client is notified and he knows that
the method has been executed at most one
time or not at all.
- If an answer is received it is forwarded to the
client who knows that the method has been
executed exactly one time.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


Distributed Systems Fö 2/3- 49 Distributed Systems Fö 2/3- 50

Conclusion with RMI Semantics and Failures (cont’d) Group Communication

☞ The assumption with client-server communication


and RMI (RPC) is that two parties are involved: the
☞ RMI semantics is different in different systems. client and the server.
Sometimes several semantics are implemented
among which the user is allowed to select. ☞ Sometimes, however, communication involves
multiple processes, not only two.
A solution is to perform separate message passing
operations or RMIs to each receiver.

☞ And no hope about achieving exactly once semantics • With group communication a message can be sent
if servers crash ?! to multiple receivers in one operation, called
multicast.

In practice, systems can come close to this goal. Such


are transaction-based systems with sophisticated Why do we need it?
protocols for error recovery. • Special applications: interest-groups, mail-lists, etc.
• Fault tolerance based on replication: a request is
sent to several servers which all execute the same
operation (if one fails, the client still will be served).
☞ More discussion in chapter on fault tolerance. • Locating a service or object in a distributed system:
the client sends a message to all machines but only
the one (or those) which holds the server/object
responds .
• Replicated data (for reliability or performance):
whenever the data changes, the new value has to
be multicast to all processes managing replicas.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH

Distributed Systems Fö 2/3- 51 Distributed Systems Fö 2/3- 52

Group Communication (cont’d) Summary

• Middleware implements high level communication


Essential features: under the form of Remote Method Invocation (RMI)
or Remote Procedure Call (RPC). They are based
on request/reply protocols which are implemented
• Atomicity (all-or-nothing property): when a using message passing on top of a network
message is multicast to a group, it will either arrive protocol (like the Internet).
correctly at all members of the group or at none of
them. • Client-server is a very frequently used
communication pattern based on a request/reply
protocol; it can be implemented using send/receive
• Ordering message passing primitives.
- FIFO ordering: The messages from any one cli-
ent to a particular server are delivered in the or- • RMI and RPC are elegant mechanisms to
der sent. implement client-server systems. Remote access is
- Totally-ordered multicast: when several mes- solved like a local one.
sages are transmitted to a group the messages
reach all the members of the group in the same • Basic components to implement RMI are: the proxy
order. object, the skeleton object, the communication
module and the remote reference module.

Either each process • An essential aspect is RMI semantics in the


P7 P2 receives the messages presence of failures. The goal is to provide exactly
P4 m1 in the order m1, m2 or once semantics. This cannot be achieved, in
to all P1 P5 each receives them in general, in the presence of server crashes.
the order m2, m1.
m • Client-server communication, in particular RMI,
P3 to a 2
ll P6 involves exactly two parties. With group
communication a message can be sent to multiple
receivers.

• Essential features of a group communication facility


are: atomicity and ordering.

Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH


CMPSCI 677 Operating Systems Spring 2022

Lecture 14: March 21


Lecturer: Prashant Shenoy Scribe: Y. Vayunandhan Reddy

14.1 Overview

This section covers the following topics:

Leader Election: Bully Algorithm, Ring Algorithm, Elections in Wireless Networks


Distributed Synchronization: Centralized, Decentralized, Distributed algorithms
Chubby Lock Service

14.2 Leader Election

Many tasks in distributed systems require one of the processes to act as the coordinator. Election algorithms
are techniques for a distributed system of N processes to elect a coordinator (leader). An example of this is
the Berkeley algorithm for clock synchronization, in which the coordinator has to initiate the synchronization
and tell the processes their offsets. A coordinator can be chosen amongst all processes through leader election.

14.2.1 Bully Algorithm

The bully algorithm is a simple algorithm, in which we enumerate all the processes running in the system
and pick the one with the highest ID as the coordinator. In this algorithm, each process has a unique ID
and every process knows the corresponding ID and IP address of every other process. A process initiates an
election if it just recovered from failure or if the coordinator failed. Any process in the system can initiate
this algorithm for leader election. Thus, we can have concurrent ongoing elections. There are three types of
messages for this algorithm: election, OK and I won. The algorithm is as follows:

1. A process with ID i initiates the election.


2. It sends election messages to all process with ID > i.
3. Any process upon receiving the election message returns an OK to its predecessor and starts an election
of its own by sending election to higher ID processes.
4. If it receives no OK messages, it knows it is the highest ID process in the system. It thus sends I won
messages to all other processes.
5. If it received OK messages, it knows it is no longer in contention and simply drops out and waits for
an I won message from some other process.
6. Any process that receives I won message treats the sender of that message as coordinator.

14-1
14-2 Lecture 14: March 21

Figure 14.1: Depiction of Bully Algorithm

An example of Bully algorithm is given in Figure 14.1. Communication is assumed to be reliable during
leader election. If the communication is unreliable, it may happen that the elected coordinator goes down
after it being elected, or a higher ID node comes up after the election process. In the former case, any node
might start an election process after gauging that the coordinator isn’t responding. In the latter case, the
higher ID process asks its neighbors who is the coordinator. It can then either accept the current coordinator
as its own coordinator and continue, or it can start a new election (in which case it will probably be elected as
the new coordinator). This algorithm runs in O(n2 ) time in the worst case when lowest ID process initiates
the election. The name bully is given to the algorithm because the higher ID processes are bullying the lower
ID processes to drop out of the election.
Question: What happens if 7 has not crashed? Who will it send message to?
Answer : Suppose if 7 has not crashed in the example, it would have sent the response when 4 has started
the election. 4 would have dropped out and the recursion would have continued and 7 would have elected
as leader finally.
Question: Can 7 never initiate the election?
Answer : If 7 is already a leader there is no reason for it to initiate an election.
Question: When does a smaller ID process know it should start an election?
Answer : This is not particularly specified by this algorithm. Ideally this is done when it has not heard from
the coordinator in a while (timeout period).
Question: In the above example what happens if 7 is recovered?
Answer : Any process that is recovered will initiate an election. It will see 6 is the coordinator. In this case
7 will initiate an election and will win.
Question: In the example, how will 7 see 6 is the coordinator (How does a process know who the coordinator
is)?
Answer : Discovering who is the coordinator is not part of the algorithm. This should be implemented
separately (storing it somewhere, broadcasting the message).
Lecture 14: March 21 14-3

Question: What happens when we have a highly dynamic system where processes regularly leave and join
(P2P system)?
Answer : The bully algorithm is not adequate for all kinds of scenarios. If you have a dynamic system, you
might want to take into account the more stable processes (or other metrics) and give them higher ids to
have them win elections.

14.2.2 Ring Algorithm

The ring algorithm is similar to the bully algorithm in the sense that we assume the processes are already
ranked through some metric from 1 to n. However, here a process i only needs to know the IP addresses of
its two neighbors (i+1 and i-1). We want to select the node with the highest id. The algorithm works as
follows:

ˆ Any node can start circulating the election message. Say process i does so. We can choose to go
clockwise or counter-clockwise on the ring. Say we choose clockwise where i+1 occurs after i.
ˆ Process i then sends an election message to process i+1.
ˆ Anytime a process j 6= i receives an election message, it piggybacks its own ID (thus declaring that it
is not down) before calling the election message on its successor (j+1).
ˆ Once the message circulates through the ring and comes back to the initiator i, process i knows the
list of all nodes that are alive. It simply scans the list and chooses the highest id.
ˆ It lets all other nodes about the new coordinator.

Figure 14.2: Depiction of Ring Algorithm

An example of Ring algorithm is given in Figure 14.2. If the neighbor of a process is down, it sequentially
polls each successor (neighbor of neighbor) until it finds a live node. For example, in the figure, when 7 is
down, 6 passes the election message to 0. Another thing to note is that this requires us to enforce a logical
ring topology on the underlying application, i.e. we need to construct a ring topology on top of the whole
system just for leader election.
Question: Does every process need to know about the network topology?
Answer : Every process know the IDs and IP addresses of other processes (assumption). There is no real
topology, from the table we can find the neighbours.
14-4 Lecture 14: March 21

Question: If we already know the IDs of all processes why there is a need to go around the ring?
Answer : There can be different kind of failures, e.g., the process may crash or the network may crash while
partitioning the ring. We do no know how many processes have disconnected from the ring. We need to
actually query and check what is the issue.
Question: How does 6 know that it has to send message to 0 when 7 is down?
Answer : Here we can assume that every node not only has information of its neighbors, but neighbors’
neighbors as well. In general, in a system where we expect a max of k nodes to go down during the time
leader election takes to execute, each node needs to know at least k successive neighbors in either direction.
This is still less than what each node needs to know in Bully algorithm.

14.2.3 Time Complexity

The bully algorithm runs in

ˆ O(n2 ) in the worst case (this occurs when the node with lowest ID initiates the election)

ˆ O(n − 2) in the best case (this occurs when the node with highest ID that is alive initiates the election)

The ring algorithm always takes 2(n-1) messages to execute. The first (n-1) is during the election query and
second time to announce the results of the election. It is easy to extend ring algorithm for other metrics like
load, etc.
Question: How do you know a node is not responding?
Answer : If it has actually crashed then TCP will fail while setting up socket connection. Otherwise, it
can be a slow machine which is taking time. It is a classical problem in distributed systems to distinguish
between a slow process and a failed process which is a non-trivial problem. Timeout is not an ideal solution
but can be used in practice.

14.3 Distributed Synchronization

Every time we wish to access a shared data structure or critical section in a distributed system, we need to
guard it with a lock. A lock is acquired before the data structure is accessed, and once the transaction has
completed, the lock is released. Consider the example below:

Figure 14.3: Example of a race condition in an online store.

In this example, there are two clients sending a buy request to the Online Store Server. The store implements
a thread-pool model. Initially the item count is 3. The correct item count should be 1 after two buy
operations. If locks are not implemented there may be chance of race condition and item count can be 2.
Lecture 14: March 21 14-5

This is because the decrement is not an atomic operation. Each thread needs to read, update and write the
item value. The second thread might read the value while first thread is updating the value (it will read 3)
ans update it to 2 and save it, which is incorrect. This is an example of trivial race condition.

14.3.1 Centralized Mutual Exclusion

In this case, locking and unlocking coordination are done by a master process. All processes are numbered
1 to n. We run leader election to pick the coordinator. Now, if any process in the system wants to acquire
a lock, it has to first send a lock acquire request to the coordinator. Once it sends this request, it blocks
execution and awaits reply until it acquires the lock. The coordinator maintains a queue for each data
structure of lock requests. Upon receiving such a request, if the queue is empty, it grants the lock and sends
the message, otherwise it adds the request to the queue. The requester process upon receiving the lock
executes the transaction, and then sends a release message to the coordinator. The coordinator upon receipt
of such a message removes the next request from the corresponding queue and grants that process the lock.
This algorithm is fair and simple.

Figure 14.4: Depiction of centralized mutual exclusion algorithm.

An example of the algorithm is given in Figure 14.4. There are two major issues with this algorithm, related
to failures. When coordinator process goes down while one of the processes is waiting on a response to a
lock request, it leads to inconsistency. The new coordinator that is elected (or reboots) might not know that
the earlier process is still waiting for a response. This issue can be tackled by maintaining persistent data
on disk whenever a queue of the coordinator is altered. Even if the process crashes, we can read the file and
persist the state of the locks on storage and recover the process.
The harder problem occurs when one of the client process crashes while it is holding the lock (during one
of its transactions). In such a case, coordinator is just waiting for the lock to be released while the other
process has gone down. We cannot use timeout in this case, because usually transactions take arbitrary
amount of time to go through. All other processes that are waiting on that lock are also blocked forever.
Even if the coordinator somehow knew that the client process has crashed, it may not always be advisable
to take the lock forcibly back because the client process may eventually reboot and think it has the lock and
continue its transaction. This causes inconsistency. This is a thorny problem which does not have any neat
solution. This limits the practicality of such an centralized algorithm.

14.3.2 Decentralized Algorithm

Decentralized algorithms use voting to figure out which lock requests to grant. In this scenario, each pro-
cess has an extra thread called the coordinator thread which deals with all the incoming locking requests.
14-6 Lecture 14: March 21

Essentially, every process keeps a track of who has the lock, and for a new process to acquire a new lock, it
has to be granted an OK or go ahead vote from the strict majority of the processes. Here, majority means
more than half the total number of nodes (live or not) in the system. Thus, if any process wishes to acquire
a lock, it requests it from all other processes and if the majority of them tell it to acquire the lock, it goes
ahead and does so. The majority guarantees that a lock is not granted twice. Upon the receipt of the vote,
the other processes are also told that a lock has been acquired and thus, the processes hold up any other
lock request. Once a process is done with the transaction, it broadcasts to every other process that it has
released the lock.
This solves the problem of coordinator failure because if some nodes go down, we can deal with it so long as
the majority agrees that whether the lock is in use or not. Client crashes are still a problem here.

14.3.3 Distributed Algorithm

This algorithm, developed by Ricart and Agrawala, needs 2(n − 1) messages and is based on Lamport’s clock
and total ordering of events to decide on granting locks. After the clocks are synchronized, the process that
asked for the lock first gets it. The initiator sends request messages to all n − 1 processes stamped with its
ID and the timestamp of its request. It then waits for replies from all other processes.
Any other process upon receiving such a request either sends reply if it does not want the lock for itself, or
is already in the transaction phase (in which case it doesn’t send any reply and the initiator has to wait),
or it itself wants to acquire the same lock in which case it compares its own request timestamp with that of
the incoming request. The one with the lower timestamp gets the lock first.

ˆ Process k enters critical section as follows:

– Generate new time stamp T Sk = T Sk+1


– Send request(k,T Sk ) all other n-1 processes
– Wait until reply(j) received from all other processes
– Enter critical section

ˆ Upon receiving a request message, process j

– Sends reply if no contention


– If already in critical section, does not reply, queue request
– If wants to enter, compare T Sj with T Sk and send reply if T Sk < T Sj , else queue (recall: total
ordering based on multicast)

This approach is fully decentralized but there are n points of failure, which is worse than the centralized
one.

14.3.4 Token Ring Algorithm

In the token ring algorithm, the actual topology is not a ring, but for locking purpose there is a logical ring
and processes only talk to neighboring processes. A process with the token has the lock at that time. To
acquire the lock one needs to wait. The token is circulated through the logical ring. No method is there
to request the token, the process needs to wait to get a lock. Once the process has token it can enter the
critical section.
Lecture 14: March 21 14-7

This was designed as part of a networking protocol Token Ring. In physical networking, only one node can
transmit at a time. If multiple nodes transmit at a time there is a chance of collision. Ethernet handled it
by detecting collisions. A node transmits and if there is a collision it backs off and will succeed eventually.
In Token Ring this was handled using locks. Only one machine has lock at an instance in the network and
it transmit at that particular time.
In this algorithm one problem is loss of token. Regenerating the token is non-trivial, as you can not use
timeout strategy.
Question: In a Token Ring, when should you hold the token?
Answer : If a process want to send data on the network it will wait until it receives the token. It will hold
the token until it completes the transmission and pass the token to the next. If it do not require the token
it simply passes it.
Question: Is the token generated by the process?
Answer : A token is special message that is circulating in the network. It is generated when the system
started. A process can hold the message or pass it to the next.
Question: In Token Ring, if the time limit is finished and the process is still in the critical section, what
happens?
Answer : In general network transmission this can limit the amount of data you can transmit. But if this is
used for locking the message can be held for arbitary amount of time (based on the critical section). This
may complicate the token recovery because we can’t distinguish if a token is lost or any process is in long
critical section.

14.4 Chubby Lock Service

This was developed by Google to provide a service that can manage lots of locks for different subgroups of
applications. Each Chubby cell has group of 5 machines supporting 10,000 servers (managing the locks of
applications running on them). This was designed for coarse-grain locking with high reliability. One of the
5 machines is maintained outside the data center for recovery.
Chubby uses distributed lock to elect a leader. The process with the lock is elected as leader. The 5 processes
in the cell run an internal leader election to elect a primary. The other are lock workers. These are used
for replication. The applications in the system ask for locks and release locks using RPC calls. All of the
locks requests go to the primary. The lock state is kept persisted on disk by all the machines. This uses a
file abstraction for locks. To lock and unlock, file is locked and unlocked respectively. State information can
also be kept in the file. It supports reader-writer locks. If the primary fails it triggers a new leader election.

Figure 14.5: Chubby lock service.


14-8 Lecture 14: March 21

Question: What is the purpose of multiple replicas?


Answer : To ensure fault tolerance and lock state doesn’t disappear.
Question: Do the replicas have any purpose till the primary fails?
Answer : They store all the lock information in the database. They work as hot standby and can takeover
with current state if the primary fails.
Question: Where are the lock files created?
Answer : We have distributed file system which looks same on all the machines.
Question: If Chubby is used for leader election how does it handle coordinator failure?
Answer : It has a notion of lease, every process can hold lease for particular lease period and needs to be
renewed. This ensures that the lock is never with a failed process.
Question: While writing is it through all nodes or just through master?
Answer : All client requests are directed to master. The master sends the operations performed to others to
keep them in sync.
Chapter 3: Logical Time

Ajay Kshemkalyani and Mukesh Singhal

Distributed Computing: Principles, Algorithms, and Systems

Cambridge University Press

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 1 / 67


Distributed Computing: Principles, Algorithms, and Systems

Introduction

The concept of causality between events is fundamental to the design and


analysis of parallel and distributed computing and operating systems.
Usually causality is tracked using physical time.
In distributed systems, it is not possible to have a global physical time.
As asynchronous distributed computations make progress in spurts, the
logical time is sufficient to capture the fundamental monotonicity property
associated with causality in distributed systems.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 2 / 67


Distributed Computing: Principles, Algorithms, and Systems

Introduction

This chapter discusses three ways to implement logical time - scalar time,
vector time, and matrix time.
Causality among events in a distributed system is a powerful concept in
reasoning, analyzing, and drawing inferences about a computation.
The knowledge of the causal precedence relation among the events of
processes helps solve a variety of problems in distributed systems, such as
distributed algorithms design, tracking of dependent events, knowledge about
the progress of a computation, and concurrency measures.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 3 / 67


Distributed Computing: Principles, Algorithms, and Systems

A Framework for a System of Logical Clocks

Definition
A system of logical clocks consists of a time domain T and a logical clock C .
Elements of T form a partially ordered set over a relation <.
Relation < is called the happened before or causal precedence. Intuitively,
this relation is analogous to the earlier than relation provided by the physical
time.
The logical clock C is a function that maps an event e in a distributed
system to an element in the time domain T , denoted as C(e) and called the
timestamp of e, and is defined as follows:
C : H 7→ T
such that the following property is satisfied:
for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 4 / 67


Distributed Computing: Principles, Algorithms, and Systems

A Framework for a System of Logical Clocks

This monotonicity property is called the clock consistency condition.


When T and C satisfy the following condition,
for two events ei and ej , ei → ej ⇔ C(ei ) < C(ej )
the system of clocks is said to be strongly consistent.
Implementing Logical Clocks
Implementation of logical clocks requires addressing two issues: data
structures local to every process to represent logical time and a protocol to
update the data structures to ensure the consistency condition.
Each process pi maintains data structures that allow it the following two
capabilities:
◮ A local logical clock, denoted by lci , that helps process pi measure its own
progress.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 5 / 67


Distributed Computing: Principles, Algorithms, and Systems

Implementing Logical Clocks

◮ A logical global clock, denoted by gci , that is a representation of process pi ’s


local view of the logical global time. Typically, lci is a part of gci .
The protocol ensures that a process’s logical clock, and thus its view of the global
time, is managed consistently. The protocol consists of the following two rules:
R1: This rule governs how the local logical clock is updated by a process
when it executes an event.
R2: This rule governs how a process updates its global logical clock to
update its view of the global time and global progress.
Systems of logical clocks differ in their representation of logical time and also
in the protocol to update the logical clocks.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 6 / 67


Distributed Computing: Principles, Algorithms, and Systems

Scalar Time

Proposed by Lamport in 1978 as an attempt to totally order events in a


distributed system.
Time domain is the set of non-negative integers.
The logical local clock of a process pi and its local view of the global time
are squashed into one integer variable Ci .
Rules R1 and R2 to update the clocks are as follows:
R1: Before executing an event (send, receive, or internal), process pi executes
the following:
Ci := Ci + d (d > 0)
In general, every time R1 is executed, d can have a different value; however,
typically d is kept at 1.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 7 / 67


Distributed Computing: Principles, Algorithms, and Systems

Scalar Time

R2: Each message piggybacks the clock value of its sender at sending time.
When a process pi receives a message with timestamp Cmsg , it executes the
following actions:
◮ Ci := max(Ci , Cmsg )
◮ Execute R1.
◮ Deliver the message.
Figure 3.1 shows evolution of scalar time.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 8 / 67


Distributed Computing: Principles, Algorithms, and Systems

Scalar Time
Evolution of scalar time:

1 2 3 8 9
p
1
9
2
1 4 5 7 11
p
2
3 10
4
1 b
p
3
5 6 7

Figure 3.1: The space-time diagram of a distributed execution.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 9 / 67


Distributed Computing: Principles, Algorithms, and Systems

Basic Properties

Consistency Property
Scalar clocks satisfy the monotonicity and hence the consistency property:
for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).
Total Ordering
Scalar clocks can be used to totally order events in a distributed system.
The main problem in totally ordering events is that two or more events at
different processes may have identical timestamp.
For example in Figure 3.1, the third event of process P1 and the second event
of process P2 have identical scalar timestamp.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 10 / 67


Distributed Computing: Principles, Algorithms, and Systems

Total Ordering

A tie-breaking mechanism is needed to order such events. A tie is broken as


follows:
Process identifiers are linearly ordered and tie among events with identical
scalar timestamp is broken on the basis of their process identifiers.
The lower the process identifier in the ranking, the higher the priority.
The timestamp of an event is denoted by a tuple (t, i) where t is its time of
occurrence and i is the identity of the process where it occurred.
The total order relation ≺ on two events x and y with timestamps (h,i) and
(k,j), respectively, is defined as follows:

x ≺ y ⇔ (h < k or (h = k and i < j))

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 11 / 67


Distributed Computing: Principles, Algorithms, and Systems

Properties. . .

Event counting
If the increment value d is always 1, the scalar time has the following
interesting property: if event e has a timestamp h, then h-1 represents the
minimum logical duration, counted in units of events, required before
producing the event e;
We call it the height of the event e.
In other words, h-1 events have been produced sequentially before the event e
regardless of the processes that produced these events.
For example, in Figure 3.1, five events precede event b on the longest causal
path ending at b.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 12 / 67


Distributed Computing: Principles, Algorithms, and Systems

Properties. . .

No Strong Consistency
The system of scalar clocks is not strongly consistent; that is, for two events
ei and ej , C(ei ) < C(ej ) 6=⇒ ei → ej .
For example, in Figure 3.1, the third event of process P1 has smaller scalar
timestamp than the third event of process P2 .However, the former did not
happen before the latter.
The reason that scalar clocks are not strongly consistent is that the logical
local clock and logical global clock of a process are squashed into one,
resulting in the loss causal dependency information among events at different
processes.
For example, in Figure 3.1, when process P2 receives the first message from
process P1 , it updates its clock to 3, forgetting that the timestamp of the
latest event at P1 on which it depends is 2.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 13 / 67


Distributed Computing: Principles, Algorithms, and Systems

Vector Time

The system of vector clocks was developed independently by Fidge, Mattern


and Schmuck.
In the system of vector clocks, the time domain is represented by a set of
n-dimensional non-negative integer vectors.
Each process pi maintains a vector vti [1..n], where vti [i] is the local logical
clock of pi and describes the logical time progress at process pi .
vti [j] represents process pi ’s latest knowledge of process pj local time.
If vti [j]=x, then process pi knows that local time at process pj has
progressed till x.
The entire vector vti constitutes pi ’s view of the global logical time and is
used to timestamp events.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 14 / 67


Distributed Computing: Principles, Algorithms, and Systems

Vector Time

Process pi uses the following two rules R1 and R2 to update its clock:
R1: Before executing an event, process pi updates its local logical time as
follows:
vti [i] := vti [i] + d (d > 0)
R2: Each message m is piggybacked with the vector clock vt of the sender
process at sending time. On the receipt of such a message (m,vt), process pi
executes the following sequence of actions:
◮ Update its global logical time as follows:

1 ≤ k ≤ n : vti [k] := max(vti [k], vt[k])

◮ Execute R1.
◮ Deliver the message m.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 15 / 67


Distributed Computing: Principles, Algorithms, and Systems

Vector Time

The timestamp of an event is the value of the vector clock of its process
when the event is executed.
Figure 3.2 shows an example of vector clocks progress with the increment
value d=1.
Initially, a vector clock is [0, 0, 0, ...., 0].

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 16 / 67


Distributed Computing: Principles, Algorithms, and Systems

Vector Time
An Example of Vector Clocks

1 2 3 4 5
0 0 0 3 3
0 0 0 4 4
p
1 5
2 2 3
0 3 4
0 0 2 2 2 4 5
1 2 3 4 6
0 0 0 0 4
p
2 5
2
3 5
0 4

0 2 2 2
0 3 3 3
1 2 3 4
p
3

Figure 3.2: Evolution of vector time.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 17 / 67


Distributed Computing: Principles, Algorithms, and Systems

Vector Time
Comparing Vector Timestamps

The following relations are defined to compare two vector timestamps, vh


and vk:

vh = vk ⇔ ∀x : vh[x] = vk[x]
vh ≤ vk ⇔ ∀x : vh[x] ≤ vk[x]
vh < vk ⇔ vh ≤ vk and ∃x : vh[x] < vk[x]
vh k vk ⇔ ¬(vh < vk) ∧ ¬(vk < vh)

If the process at which an event occurred is known, the test to compare two
timestamps can be simplified as follows: If events x and y respectively
occurred at processes pi and pj and are assigned timestamps vh and vk,
respectively, then

x →y ⇔ vh[i] ≤ vk[i]
x ky ⇔ vh[i] > vk[i] ∧ vh[j] < vk[j]

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 18 / 67


Distributed Computing: Principles, Algorithms, and Systems

Vector Time
Properties of Vectot Time

Isomorphism
If events in a distributed system are timestamped using a system of vector
clocks, we have the following property.
If two events x and y have timestamps vh and vk, respectively, then

x →y ⇔ vh < vk
x ky ⇔ vh k vk.

Thus, there is an isomorphism between the set of partially ordered events


produced by a distributed computation and their vector timestamps.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 19 / 67


Distributed Computing: Principles, Algorithms, and Systems

Vector Time
Strong Consistency
The system of vector clocks is strongly consistent; thus, by examining the
vector timestamp of two events, we can determine if the events are causally
related.
However, Charron-Bost showed that the dimension of vector clocks cannot be
less than n, the total number of processes in the distributed computation, for
this property to hold.
Event Counting
If d=1 (in rule R1), then the i th component of vector clock at process pi ,
vti [i], denotes the number of events that have occurred at pi until that
instant.
So, if an event e has timestamp vh, vh[j] denotes the number
P of events
executed by process pj that causally precede e. Clearly, vh[j] − 1
represents the total number of events that causally precede e in the
distributed computation.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 20 / 67


Distributed Computing: Principles, Algorithms, and Systems

Efficient Implementations of Vector Clocks

If the number of processes in a distributed computation is large, then vector


clocks will require piggybacking of huge amount of information in messages.
The message overhead grows linearly with the number of processors in the
system and when there are thousands of processors in the system, the
message size becomes huge even if there are only a few events occurring in
few processors.
We discuss an efficient way to maintain vector clocks.
Charron-Bost showed that if vector clocks have to satisfy the strong
consistency property, then in general vector timestamps must be at least of
size n, the total number of processes.
However, optimizations are possible and next, and we discuss a technique to
implement vector clocks efficiently.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 21 / 67


Distributed Computing: Principles, Algorithms, and Systems

Singhal-Kshemkalyani’s Differential Technique

Singhal-Kshemkalyani’s differential technique is based on the observation that


between successive message sends to the same process, only a few entries of
the vector clock at the sender process are likely to change.
When a process pi sends a message to a process pj , it piggybacks only those
entries of its vector clock that differ since the last message sent to pj .
If entries i1 , i2 , . . . , in1 of the vector clock at pi have changed to
v1 , v2 , . . . , vn1 , respectively, since the last message sent to pj , then process pi
piggybacks a compressed timestamp of the form:
{(i1 , v1 ), (i2 , v2 ), . . . , (in1 , vn1 )}
to the next message to pj .

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 22 / 67


Distributed Computing: Principles, Algorithms, and Systems

Singhal-Kshemkalyani’s Differential Technique

When pj receives this message, it updates its vector clock as follows:


vti [ik ] = max(vti [ik ], vk ) for k = 1, 2, . . . , n1 .

Thus this technique cuts down the message size, communication bandwidth
and buffer (to store messages) requirements.
In the worst of case, every element of the vector clock has been updated at
pi since the last message to process pj , and the next message from pi to pj
will need to carry the entire vector timestamp of size n.
However, on the average the size of the timestamp on a message will be less
than n.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 23 / 67


Distributed Computing: Principles, Algorithms, and Systems

Singhal-Kshemkalyani’s Differential Technique


Implementation of this technique requires each process to remember the
vector timestamp in the message last sent to every other process.
Direct implementation of this will result in O(n2 ) storage overhead at each
process.
Singhal and Kshemkalyani developed a clever technique that cuts down this
storage overhead at each process to O(n). The technique works in the
following manner:
Process pi maintains the following two additional vectors:
◮ LSi [1..n] (‘Last Sent’):
LSi [j] indicates the value of vti [i] when process pi last sent a message to
process pj .
◮ LUi [1..n] (‘Last Update’):
LUi [j] indicates the value of vti [i] when process pi last updated the entry vti [j].
Clearly, LUi [i] = vti [i] at all times and LUi [j] needs to be updated only when
the receipt of a message causes pi to update entry vti [j]. Also, LSi [j] needs
to be updated only when pi sends a message to pj .

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 24 / 67


Distributed Computing: Principles, Algorithms, and Systems

Singhal-Kshemkalyani’s Differential Technique

Since the last communication from pi to pj , only those elements of vector


clock vti [k] have changed for which LSi [j] < LUi [k] holds.
Hence, only these elements need to be sent in a message from pi to pj .
When pi sends a message to pj , it sends only a set of tuples
{(x, vti [x])|LSi [j] < LUi [x]}
as the vector timestamp to pj , instead of sending a vector of n entries in a
message.
Thus the entire vector of size n is not sent along with a message. Instead,
only the elements in the vector clock that have changed since the last
message send to that process are sent in the format
{(p1 , latest value), (p2 , latest value), . . .}, where pi indicates that the pi th
component of the vector clock has changed.
This technique requires that the communication channels follow FIFO
discipline for message delivery.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 25 / 67


Distributed Computing: Principles, Algorithms, and Systems

Singhal-Kshemkalyani’s Differential Technique

This method is illustrated in Figure 3.3. For instance, the second message
from p3 to p2 (which contains a timestamp {(3, 2)}) informs p2 that the third
component of the vector clock has been modified and the new value is 2.
This is because the process p3 (indicated by the third component of the
vector) has advanced its clock value from 1 to 2 since the last message sent
to p2 .
This technique substantially reduces the cost of maintaining vector clocks in
large systems, especially if the process interactions exhibit temporal or spatial
localities.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 26 / 67


Distributed Computing: Principles, Algorithms, and Systems

Singhal-Kshemkalyani’s Differential Technique

p
1
1 {(1,1)}
0
1 1 1 1
0 1 2 3 4
0
0 1 2 4
0 0 0 1
p
2
0 0 0 0
0 0 0 {(3,4),(4,1)}
{(3,1)} {(3,2)} 0
1 2 3 4
0 0 1 1
p
3
0
0 {(4,1)}
0
1
p
4

Figure 3.3: Vector clocks progress in Singhal-Kshemkalyani technique.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 27 / 67


Distributed Computing: Principles, Algorithms, and Systems

Matrix Time

In a system of matrix clocks, the time is represented by a set of n × n matrices of


non-negative integers.
A process pi maintains a matrix mti [1..n, 1..n] where,
mti [i, i] denotes the local logical clock of pi and tracks the progress of the
computation at process pi .
mti [i, j] denotes the latest knowledge that process pi has about the local
logical clock, mtj [j, j], of process pj .
mti [j, k] represents the knowledge that process pi has about the latest
knowledge that pj has about the local logical clock, mtk [k, k], of pk .
The entire matrix mti denotes pi ’s local view of the global logical time.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 28 / 67


Distributed Computing: Principles, Algorithms, and Systems

Matrix Time
Process pi uses the following rules R1 and R2 to update its clock:
R1 : Before executing an event, process pi updates its local logical time as
follows:
mti [i, i] := mti [i, i] + d (d > 0)
R2: Each message m is piggybacked with matrix time mt. When pi receives
such a message (m,mt) from a process pj , pi executes the following sequence
of actions:
◮ Update its global logical time as follows:

(a) 1 ≤ k ≤ n : mti [i, k] := max(mti [i, k], mt[j, k])

(That is, update its row mti [i, ∗] with the pj ’s row in the received timestamp,
mt.)
(b) 1 ≤ k, l ≤ n : mti [k, l] := max(mti [k, l], mt[k, l])
◮ Execute R1.
◮ Deliver message m.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 29 / 67


Distributed Computing: Principles, Algorithms, and Systems

Matrix Time

Figure 3.4 gives an example to illustrate how matrix clocks progress in a


distributed computation. We assume d=1.
Let us consider the following events: e which is the xi -th event at process pi ,
ek1 and ek2 which are the xk1 -th and xk2 -th event at process pk , and ej1 and ej2
which are the xj1 -th and xj2 -th events at pj .
Let mte denote the matrix timestamp associated with event e. Due to
message m4 , ek2 is the last event of pk that causally precedes e, therefore, we
have mte [i, k]=mte [k, k]=xk2.
Likewise, mte [i, j]=mte [j, j]=xj2 . The last event of pk known by pj , to the
knowledge of pi when it executed event e, is ek1 ; therefore, mte [j, k]=xk1.
Likewise, we have mte [k, j]=xj1 .

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 30 / 67


Distributed Computing: Principles, Algorithms, and Systems

Matrix Time

mt e [i,k ] mt e [i,k ]
e 1k e 2k
p
k
mt e [k,j ] m2 mt e [j,j ]
m1 m4
e 1j e 2j
p
j
m3
e
p
i
mte

Figure 3.4: Evolution of matrix time.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 31 / 67


Distributed Computing: Principles, Algorithms, and Systems

Matrix Time

Basic Properties
Vector mti [i, .] contains all the properties of vector clocks.
In addition, matrix clocks have the following property:
mink (mti [k, l]) ≥ t ⇒ process pi knows that every other process pk knows
that pl ’s local time has progressed till t.
◮ If this is true, it is clear that process pi knows that all other processes know
that pl will never send information with a local time ≤ t.
◮ In many applications, this implies that processes will no longer require from pl
certain information and can use this fact to discard obsolete information.
If d is always 1 in the rule R1, then mti [k, l] denotes the number of events
occurred at pl and known by pk as far as pi ’s knowledge is concerned.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 32 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time

Virtual time system is a paradigm for organizing and synchronizing


distributed systems.
This section a provides description of virtual time and its implementation
using the Time Warp mechanism.
The implementation of virtual time using Time Warp mechanism works on
the basis of an optimistic assumption.
Time Warp relies on the general lookahead-rollback mechanism where each
process executes without regard to other processes having synchronization
conflicts.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 33 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time

If a conflict is discovered, the offending processes are rolled back to the time
just before the conflict and executed forward along the revised path.
Detection of conflicts and rollbacks are transparent to users.
The implementation of Virtual Time using Time Warp mechanism makes the
following optimistic assumption: synchronization conflicts and thus rollbacks
generally occurs rarely.
next, we discuss in detail Virtual Time and how Time Warp mechanism is
used to implement it.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 34 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

“Virtual time is a global, one dimensional, temporal coordinate system on a


distributed computation to measure the computational progress and to define
synchronization.”

A virtual time system is a distributed system executing in coordination with


an imaginary virtual clock that uses virtual time.
Virtual times are real values that are totally ordered by the less than relation,
“<”.
Virtual time is implemented a collection of several loosely synchronized local
virtual clocks.
These local virtual clocks move forward to higher virtual times; however,
occasionaly they move backwards.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 35 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

Processes run concurrently and communicate with each other by exchanging


messages.
Every message is characterized by four values:
a) Name of the sender
b) Virtual send time
c) Name of the receiver
d) Virtual receive time
Virtual send time is the virtual time at the sender when the message is sent,
whereas virtual receive time specifies the virtual time when the message must
be received (and processed) by the receiver.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 36 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

A problem arises when a message arrives at process late, that is, the virtual
receive time of the message is less than the local virtual time at the receiver
process when the message arrives.
Virtual time systems are subject to two semantic rules similar to Lamport’s
clock conditions:
◮ Rule 1: Virtual send time of each message < virtual receive time of that
message.
◮ Rule 2: Virtual time of each event in a process < Virtual time of next event in
that process.
The above two rules imply that a process sends all messages in increasing
order of virtual send time and a process receives (and processes) all messages
in the increasing order of virtual receive time.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 37 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

Causality of events is an important concept in distributed systems and is also


a major constraint in the implementation of virtual time.
It is important an event that causes another should be completely executed
before the caused event can be processed.
The constraint in the implementation of virtual time can be stated as follows:
“If an event A causes event B, then the execution of A and B must be
scheduled in real time so that A is completed before B starts”.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 38 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

If event A has an earlier virtual time than event B, we need execute A before
B provided there is no causal chain from A to B.
Better performance can be achieved by scheduling A concurrently with B or
scheduling A after B.
If A and B have exactly the same virtual time coordinate, then there is no
restriction on the order of their scheduling.
If A and B are distinct events, they will have different virtual space
coordinates (since they occur at different processes) and neither will be a
cause for the other.
To sum it up, events with virtual time < ‘t’ complete before the starting of
events at time ‘t’ and events with virtual time > ‘t’ will start only after
events at time ‘t’ are complete.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 39 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

Characteristics of Virtual Time


1 Virtual time systems are not all isomorphic; it may be either discrete or
continuous.
2 Virtual time may be only partially ordered.
3 Virtual time may be related to real time or may be independent of it.
4 Virtual time systems may be visible to programmers and manipulated
explicitly as values, or hidden and manipulated implicitly according to some
system-defined discipline
5 Virtual times associated with events may be explicitly calculated by user
programs or they may be assigned by fixed rules.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 40 / 67


Distributed Computing: Principles, Algorithms, and Systems

Comparison with Lamport’s Logical Clocks


In Lamport’s logical clock, an artificial clock is created one for each process
with unique labels from a totally ordered set in a manner consistent with
partial order.
In virtual time, the reverse of the above is done by assuming that every event
is labeled with a clock value from a totally ordered virtual time scale
satisfying Lamport’s clock conditions.
Thus the Time Warp mechanism is an inverse of Lamport’s scheme.
In Lamport’s scheme, all clocks are conservatively maintained so that they
never violate causality.
A process advances its clock as soon as it learns of new causal dependency.
In the virtual time, clocks are optimisticaly advanced and corrective actions
are taken whenever a violation is detected.
Lamport’s initial idea brought about the concept of virtual time but the
model failed to preserve causal independence.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 41 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

Time Warp Mechanism


In the implementation of virtual time using Time Warp mechanism, virtual
receive time of message is considered as its timestamp.
The necessary and sufficient conditions for the correct implementation of
virtual time are that each process must handle incoming messages in
timestamp order.
This is highly undesirable and restrictive because process speeds and message
delays are likely to highly variable.
It natural for some processes to get ahead in virtual time of other processes.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 42 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

Time Warp Mechanism


It is impossible for a process on the basis of local information alone to block
and wait for the message with the next timestamp.
It is always possible that a message with earlier timestamp arrives later.
So, when a process executes a message, it is very difficult for it determine
whether a message with an earlier timestamp will arrive later.
This is the central problem in virtual time that is solved by the Time Warp
mechanism.
The Time warp mechanism assumes that message communication is reliable,
nad messages may not be delivered in FIFO order.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 43 / 67


Distributed Computing: Principles, Algorithms, and Systems

Virtual Time Definition

Time Warp Mechanism


Time Warp mechanism consists of two major parts: local control mechanism
and global control mechanism.
The local control mechanism insures that events are executed and messages
are processed in the correct order.
The global control mechanism takes care of global issues such as global
progress, termination detection, I/O error handling, flow control, etc.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 44 / 67


Distributed Computing: Principles, Algorithms, and Systems

The Local Control Mechanism

There is no global virtual clock variable in this implementation; each process


has a local virtual clock variable.
The local virtual clock of a process doesn’t change during an event at that
process but it changes only between events.
On the processing of next message from the input queue, the process
increases its local clock to the timestamp of the message.
At any instant, the value of virtual time may differ for each process but the
value is transparent to other processes in the system.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 45 / 67


Distributed Computing: Principles, Algorithms, and Systems

The Local Control Mechanism

When a message is sent, the virtual send time is copied from the sender’s
virtual clock while the name of the receiver and virtual receive time are
assigned based on application specific context.
All arriving messages at a process are stored in an input queue in the
increasing order of timestamps (receive times).
Processes will receive late messages due to factors such as different
computation rates of processes and network delays.
The semantics of virtual time demands that incoming messages be received
by each process strictly in the timestamp order.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 46 / 67


Distributed Computing: Principles, Algorithms, and Systems

The Local Control Mechanism

This is accomplished as follows:


“On the reception of a late message, the receiver rolls back to an earlier
virtual time, cancelling all intermediate side effects and then executes forward
again by executing the late message in the proper sequence.”
If all the messages in the input queue of a process are processed, the state of
the process is said to terminate and its clock is set to + inf.
However, the process is not destroyed as a late message may arrive resulting
it to rollback and execute again.
Thus, each process is doing a constant “lookahead”, processing future
messages from its input queue.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 47 / 67


Distributed Computing: Principles, Algorithms, and Systems

The Local Control Mechanism

Over a length computation, each process may roll back several times while
generally progressing forward with rollback completely transparent to other
processes in the system.
Rollback in a distributed system is complicated: A process that wants to
rollback might have sent many messages to other processes, which in turn
might have sent many messages to other processes, and so on, leading to
deep side effects.
For rollback, messages must be effectively “unsent” and their side effects
should be undone. This is achieved efficiently by using antimessages.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 48 / 67


Distributed Computing: Principles, Algorithms, and Systems

The Local Control Mechanism


Antimessages and the Rollback Mechanism
Runtime representation of a process is composed of the following:
Process name: Virtual spaces coordinate which is unique in the system.
Local virtual clock: Virtual time coordinate
State: Data space of the process including execution stack, program counter
and its own variables
State queue: Contains saved copies of process’s recent states as roll back
with Time warp mechanism requires the state of the process being saved.
Input queue: Contains all recently arrived messages in order of virtual
receive time. Processed messages from the input queue are not deleted as
they are saved in the output queue with a negative sign (antimessage) to
facilitate future roll backs.
Output queue: Contains negative copies of messages the process has
recently sent in virtual send time order. They are needed in case of a rollback.
For every message, there exists an antimessage that is the same in content but
opposite in sign.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 49 / 67


Distributed Computing: Principles, Algorithms, and Systems

Antimessages and the Rollback Mechanism

Whenever a process sends a message, a copy of the message is transmitted to


receiver’s input queue and a negative copy (antimessage) is retained in the
sender’s output queue for use in sender rollback.
Whenever a message and its antimessage appear in the same queue no
matter in which order they arrived, they immediately annihilate each other
resulting in shortening of the queue by one message.
When a message arrives at the input queue of a process with timestamp
greater than virtual clock time of its destination process, it is simply
enqueued.
When the destination process’ virtual time is greater than the virtual time of
message received, the process must do a rollback.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 50 / 67


Distributed Computing: Principles, Algorithms, and Systems

Antimessages and the Rollback Mechanism


Rollback Mechanism

Search the ”State queue” for the last saved state with timestamp that is less
than the timestamp of the message received and restore it.
Make the timestamp of the received message as the value of the local virtual
clock and discard from the state queue all states saved after this time. Then
the resume execution forward from this point.
Now all the messages that are sent between the current state and earlier
state must be “unsent”. This is taken care of by executing a simple rule:
“To unsend a message, simply transmit its antimessage.”
This results in antimessages following the positive ones to the destination. A
negative message causes a rollback at its destination if it’s virtual receive
time is less than the receiver’s virtual time.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 51 / 67


Distributed Computing: Principles, Algorithms, and Systems

Antimessages and the Rollback Mechanism

Depending on the timing, there are several possibilities at the receiver’s end:
First, the original (positive) message has arrived but not yet been processed
at the receiver.
In this case, the negative message causes no rollback, however, it annihilates
with the positive message leaving the receiver with no record of that message.
Second, the original positive message has already been partially or completely
processed by the receiver.
In this case, the negative message causes the receiver to roll back to a virtual
time when the positive message was received.
It will also annihilate the positive message leaving the receiver with no record
that the message existed. When the receiver executes again, the execution
will assume that these message never existed.
A rolled back process may send antimessages to other processes.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 52 / 67


Distributed Computing: Principles, Algorithms, and Systems

Antimessages and the Rollback Mechanism

A negative message can also arrive at the destination before the positive one.
In this case, it is enqueued and will be annihilated when positive message
arrives.
If it is negative message’s turn to be executed at a processs’ input queqe, the
receiver may take any action like a no-op.
Any action taken will eventually be rolled back when the corresponding
positive message arrives.
An optimization would be to skip the antimessage from the input queue and
treat it as a no-op, and when the corresponding positive message arrives, it
will annihilate the negative message, and inhibit any rollback.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 53 / 67


Distributed Computing: Principles, Algorithms, and Systems

Antimessages and the Rollback Mechanism

The antimessage protocol has several advantages:


It is extremely robust and works under all possible circumstances.
It is free from deadlocks as there is no blocking.
It is also free from domino effects.
In the worst case, all processes in system roll back to same virtual time as
original one did and then proceed forward again.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 54 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP

Motivation
In centralized systems, there is only single clock. A process gets the time by
simply issuing a system call to the kernel.
In distributed systems, there is no global clock or common memory. Each
processor has its own internal clock and its own notion of time.
These clocks can easily drift seconds per day, accumulating significant errors
over time.
Also, because different clocks tick at different rates, they may not remain
always synchronized although they might be synchronized when they start.
This clearly poses serious problems to applications that depend on a
synchronized notion of time.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 55 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP

Motivation

For most applications and algorithms that run in a distributed system, we


need to know time in one or more of the following contexts:
◮ The time of the day at which an event happened on a specific machine in the
network.
◮ The time interval between two events that happened on different machines in
the network.
◮ The relative ordering of events that happened on different machines in the
network.
Unless the clocks in each machine have a common notion of time, time-based
queries cannot be answered.
Clock synchronization has a significant effect on many problems like secure
systems, fault diagnosis and recovery, scheduled operations, database
systems, and real-world clock values.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 56 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP

Clock synchronization is the process of ensuring that physically distributed


processors have a common notion of time.
Due to different clocks rates, the clocks at various sites may diverge with
time and periodically a clock synchronization must be performed to correct
this clock skew in distributed systems.
Clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time).
Clocks that must not only be synchronized with each other but also have to
adhere to physical time are termed physical clocks.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 57 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP


Definitions and Terminology
Let Ca and Cb be any two clocks.
Time: The time of a clock in a machine p is given by the function Cp (t),
where Cp (t) = t for a perfect clock.
Frequency: Frequency is the rate at which a clock progresses. The

frequency at time t of clock Ca is Ca (t).
Offset: Clock offset is the difference between the time reported by a clock
and the real time. The offset of the clock Ca is given by Ca (t) − t. The
offset of clock Ca relative to Cb at time t ≥ 0 is given by Ca (t) − Cb (t).
Skew: The skew of a clock is the difference in the frequencies of the clock
and the perfect clock. The skew of a clock Ca relative to clock Cb at time t
is (Ca′ (t) − Cb′ (t)). If the skew is bounded by ρ, then as per Equation (1),
clock values are allowed to diverge at a rate in the range of 1 − ρ to 1 + ρ.
Drift (rate): The drift of clock Ca is the second derivative of the clock value
with respect to time, namely, Ca′′ (t). The drift of clock Ca relative to clock
Cb at time t is Ca′′ (t) − Cb′′ (t).

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 58 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP

Clock Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time).
However, due to the clock inaccuracy discussed above, a timer (clock) is said
to be working within its specification if (where constant ρ is the maximum
skew rate specified by the manufacturer.)

dC
1−ρ≤ ≤1+ρ (1)
dt
Figure 3.5 illustrates the behavior of fast, slow, and perfect clocks with
respect to UTC.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 59 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP

Fast Clock
dC/dt > 1
Perfect Clock
dC/dt = 1

Clock time, C
Slow Clock
dC/dt < 1

UTC, t

Figure 3.5: The behavior of fast, slow, and perfect clocks with respect to UTC.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 60 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP

Offset delay estimation method


The Network Time Protocol (NTP) which is widely used for clock
synchronization on the Internet uses the The Offset Delay Estimation
method.
The design of NTP involves a hierarchical tree of time servers.
◮ The primary server at the root synchronizes with the UTC.
◮ The next level contains secondary servers, which act as a backup to the
primary server.
◮ At the lowest level is the synchronization subnet which has the clients.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 61 / 67


Distributed Computing: Principles, Algorithms, and Systems

Physical Clock Synchronization: NTP

Clock offset and delay estimation:


In practice, a source node cannot accurately estimate the local time on the target
node due to varying message or network delays between the nodes.
This protocol employs a common practice of performing several trials and
chooses the trial with the minimum delay.
Figure 3.6 shows how NTP timestamps are numbered and exchanged
between peers A and B.
Let T1 , T2 , T3 , T4 be the values of the four most recent timestamps as shown.
Assume clocks A and B are stable and running at the same speed.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 62 / 67


Distributed Computing: Principles, Algorithms, and Systems

T1 T2
B

A
T3 T4

Figure 3.6: Offset and delay estimation.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 63 / 67


Distributed Computing: Principles, Algorithms, and Systems

Let a = T1 − T3 and b = T2 − T4 .
If the network delay difference from A to B and from B to A, called
differential delay, is small, the clock offset θ and roundtrip delay δ of B
relative to A at time T4 are approximately given by the following.
a+b
θ= , δ =a−b (2)
2
Each NTP message includes the latest three timestamps T1 , T2 and T3 ,
while T4 is determined upon arrival.
Thus, both peers A and B can independently calculate delay and offset using
a single bidirectional message stream as shown in Figure 3.7.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 64 / 67


Distributed Computing: Principles, Algorithms, and Systems

Server A T i-2 T i-1

Server B T i-3 Ti

Figure 3.7: Timing diagram for the two servers.

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 65 / 67


Distributed Computing: Principles, Algorithms, and Systems

The Network Time Protocol synchronization


protocol.

A pair of servers in symmetric mode exchange pairs of timing messages.


A store of data is then built up about the relationship between the two
servers (pairs of offset and delay).
Specifically, assume that each peer maintains pairs (Oi ,Di ), where
Oi - measure of offset (θ)
Di - transmission delay of two messages (δ).
The offset corresponding to the minimum delay is chosen.
Specifically, the delay and offset are calculated as follows. Assume that
message m takes time t to transfer and m′ takes t ′ to transfer.

(Continued on the next slide . . . .)

A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 66 / 67


Distributed Computing: Principles, Algorithms, and Systems

The Network Time Protocol synchronization


protocol.
The offset between A’s clock and B’s clock is O. If A’s local clock time is
A(t) and B’s local clock time is B(t), we have
A(t) = B(t) + O (3)
Then,
Ti −2 = Ti −3 + t + O (4)

Ti = Ti −1 − O + t (5)
Assuming t = t ′ , the offset Oi can be estimated as:
Oi = (Ti −2 − Ti −3 + Ti −1 − Ti )/2 (6)
The round-trip delay is estimated as:
Di = (Ti − Ti −3 ) − (Ti −1 − Ti −2 ) (7)
The eight most recent pairs of (Oi , Di ) are retained.
The value of Oi that corresponds to minimum Di is chosen to estimate O.
A. Kshemkalyani and M. Singhal (Distributed Computing) Logical Time CUP 2008 67 / 67

You might also like