0% found this document useful (0 votes)
19 views119 pages

Nunit I

Uploaded by

Tamilselvan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views119 pages

Nunit I

Uploaded by

Tamilselvan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

21CS401 /DISTRIBUTED SYSTEMS

Course Objective
• To explain the foundation and challenges of distributed systems.
• To infer the knowledge of message ordering and group
communication.
• To demonstrate the distributed mutual exclusion and deadlock
detection algorithms.
• To predict the significance of check pointing and rollback recovery
algorithms
• To summarize the characteristics of peer-to-peer and distributed
shared memory system
SYLLABUS
UNIT-1
• Introduction: Definition –Characteristics-Relation to computer system
components –Motivation – Message-passing systems versus shared
memory systems –Primitives for distributed communication –
Synchronous versus asynchronous executions –Challenges of
Distributed system: System Perspective. A model of distributed
computations: A distributed program –A model of distributed
executions –Models of communication networks –Global state – Cuts
of a distributed computation.
UNIT-2-MESSAGE ORDERING & GROUP
COMMUNICATION
• Message ordering and group communication: Message ordering
paradigms –Asynchronous execution with synchronous
communication –Synchronous program order on an asynchronous
system –Group communication – Causal order (CO) - Total order.
UNIT-3-DISTRIBUTED MUTEX & DEADLOCK
• Distributed mutual exclusion algorithms: Introduction – Preliminaries
– Lamport‘s algorithm – Ricart-Agrawala algorithm – Maekawa‘s
algorithm – Suzuki–Kasami‘s broadcast algorithm. Deadlock detection
in distributed systems: Introduction – System model – Preliminaries –
Models of deadlocks – Knapp‘s classification.
UNIT-4-CHECKPOINTING AND ROLLBACK
RECOVERY
• Introduction – Background and definitions – Issues in failure recovery
– Checkpoint-based recovery – Log-based rollback recovery – Koo-
Toueg coordinated checkpointing algorithm – Juang- Venkatesan
algorithm for asynchronous checkpointing and recovery.
UNIT-5-P2P & DISTRIBUTED SHARED MEMORY
• Peer-to-peer computing and overlay graphs: Introduction – Data
indexing and overlays – Content addressable networks – Tapestry.
Distributed shared memory: Abstraction and advantages – Types of
memory consistency models.
COURSE OUTCOMES

• CO1: Illustrate the models of communication in building a distributed


environment. (K2-Understand)
• CO2: Interpret the order of message in communication network for
synchronous and asynchronous system. (K2-Understand)
• CO3: Use the Mutex and Deadlock detection algorithm in real time
application.( K2-Understand)
• CO4: Discover the issues of check pointing and rollback recovery
mechanisms in distributed environment. (K2-Understand)
• CO5: Relate the features of peer-to-peer and memory consistency models
for a given application.
• (K2-Understand)
TEXT BOOK(S):

• T1: Kshemkalyani, Ajay D., and Mukesh Singhal. Distributed


computing: principles, algorithms, and systems. Cambridge University
Press, 2011.
• T2: George Coulouris, Jean Dollimore and Tim Kindberg, ”Distributed
Systems Concepts and Design”, 5th Edition, Pearson Education, 2017.
• Tanenbaum A.S., Van Steen M., "Distributed Systems: Principles and
paradigms”, 2nd Edition, Pearson
• Education, 2017.
UNIT I
Chapter 1,2
Introduction: Definition –Characteristics-Relation to computer system
components –Motivation – Message-passing systems versus shared
memory systems –Primitives for distributed communication –
Synchronous versus asynchronous executions –Challenges of Distributed
system: System Perspective. A model of distributed computations: A
distributed program –A model of distributed executions –Models of
communication networks –Global state – Cuts of a distributed
computation.
Chapter 1-Introduction

• A distributed system is a system in which components are located on


different networked computers, which can communicate and coordinate their
actions by passing messages to one another.

• A distributed System is a collection of autonomous computer systems that are


physically separated but are connected by a centralized computer network
that is equipped with distributed system software
• A distributed system is a collection of independent(autonomous)
entities that cooperate to solve a problem that cannot be
individually solved.
Middleware

• Middleware refers to software that acts as an intermediary between


different applications, services, or components in a distributed
system.
Types of Distributed Systems

• Client-server systems: The most traditional and simple type of


distributed system, involves a multitude of networked computers
that interact with a central server for data storage, processing, or
other common goal.
Types of Distributed Systems

• Peer-to-peer networks: They distribute workloads among hundreds

or thousands of computers all running the same software.


Characteristics of distributed system:
• No common physical clock

• element of “distribution” in the system and gives rise to the

inherent asynchrony among the processors

• No shared memory
• it requires message-passing for communication

• distributed system provides the abstraction of common address


space via distributed shared memory abstraction.
Characteristics of distributed system
• Geographical separation

• The geographically wider apart processors are the representative of a

distributed system

• it may be in wide-area network (WAN) or the network/cluster of

workstations (NOW/COW) configuration connecting processors on a LAN

• NOW configuration is the low-cost high-speed off-the-shelf processors.

Example: Google search engine.


Characteristics of distributed system
• Autonomy and heterogeneity

• The processors are “loosely coupled” having different speeds and

each runs different operating system but cooperate with one another

by offering services for solving a problem jointly.


Relation to computer system components
Relation to computer system components
• In distributed system each computer has a memory-processing
unit and are connected by a communication network.

• Figure 1.2 shows the relationships of software components that


run on computers use the local operating system and network
protocol stack for functioning.

• A distributed software is also termed as middleware.


Relation to computer system components

• A distributed execution is the execution of processes across the


distributed system to collaboratively achieve a common goal which is
also termed a computation or a run.

• A distributed system follows a layered architecture that reduces the


complexity of the system design.

• Middleware hides the heterogeneity transparently at the platform


level
Relation to computer system components
• It is assumed that the middleware layer does not contain the
application layer functions like http, mail, ftp, and telnet.

• User program code includes the code to invoke libraries of the


middleware layer to support the reliable and ordered multicasting.
Relation to computer system components

• Some of the commercial versions of middleware often in use are

CORBA, DCOM (distributed component object model), Java, and

RMI (remote method invocation), message-passing interface (MPI).


Motivation
• Inherently distributed computations

• Resource sharing

• Access to geographically remote data and resources

• Increased performance/cost ratio


• availability, integrity, fault-tolerance

• Enhanced reliability

• Scalability

• Modularity and incremental expandability


Motivation
• Inherently distributed computations

• Resource sharing
• Resources cannot be fully replicated at all the sites because it is often neither practical
nor cost-effective.

• Access to geographically remote data and resources


• In many scenarios, the data cannot be replicated at every site participating in the
distributed execution because it may be too large or too sensitive to be replicated.
Enhanced reliability
• A distributed system increased reliability because of the possibility of
replicating resources and executions

• geographically distributed resources are not likely to crash/malfunction


at the same time under normal circumstances.
• Availability: resource should be accessible at all times;

• Integrity: value/state of the resource must be correct, in the face of concurrent

access from multiple processors,

• Fault-tolerance: ability to recover from system failures.

• Increased performance/cost ratio

• By resource sharing and accessing geographically remote data and resources,

the performance/cost ratio is increased.


• Scalability
• As the processors are usually connected by a wide-area network, adding more
processors does not pose a direct bottleneck for the communication network.

• Modularity and incremental expandability


• Heterogeneous processors may be easily added into the system without affecting the
performance, as long as those processors are running the same middleware
algorithms.

• Similarly, existing processors may be easily replaced by other processors.


Message-passing vs. Shared Memory
Shared memory systems are those in which there is a (common) shared address
space throughout the system.
• Communication among processors takes place via shared data variables, and
control variables for synchronization among the processors.
• Eg for shared memory:Semaphores and monitors
Message passing
• All multicomputer systems do not have shared address space but communicate
by message passing.
• Programmers find it easier to program using shared memory than by message
passing. →leads to development of abstraction(idea)
• Abstraction (shared memory) is provided to simulate a shared address space.
• For a distributed system, this abstraction is called distributed shared memory.
Message-passing vs. Shared Memory
• 1.5.1 Emulating message-passing on a shared memory system (MP →SM)

• The shared address space is partitioned into disjoint parts, one part being

assigned to each processor.

• “Send” and “receive” operations are implemented for writing to and

reading from the destination/sender processor’s address space,

respectively.
Message-passing vs. Shared Memory
• Specifically, a separate location is reserved as mailbox (assumed to have
unbounded in size) for each ordered pair of processes.

• A Pi–Pj message-passing can be emulated by a write by Pi to the mailbox


and then a read by Pj from the mailbox.

• The write and read operations are controlled using synchronization


primitives to inform the receiver/sender after the data has been
sent/received.
Message-passing vs. Shared Memory
• 1.5.2 Emulating shared memory on a message-passing system (SM
→MP)
• This involves use of “send” and “receive” operations for “write” and
“read” operations.

• Each shared location can be modeled as a separate process;

• “write” to a shared location is emulated by sending an update


message to the corresponding owner process and a “read” by
sending a query message.
Message-passing vs. Shared Memory
• As accessing another processor’s memory requires send and receive
operations, this emulation is expensive.

• In a MIMD message-passing multicomputer system, each “processor” may


be a tightly coupled multiprocessor system with shared memory. Within
the multiprocessor system, the processors communicate via shared
memory.

• Between two computers, the communication by message passing are more


suited for wide-area distributed systems.
Primitives for distributed communication
• Blocking/non-blocking, synchronous/asynchronous primitives
• Processor synchrony
• Libraries and standards
Blocking/non-blocking, synchronous/asynchronous primitives

• Message send and message receive communication primitives are denoted

by Send() and Receive().

• Send primitive has 2 parameters→ destination, user buffer space

containing the data to be sent

• Receive primitive has 2 parameters →source from which the data to be

received , user buffer space into which the data is to be received.


Blocking/non-blocking, synchronous/asynchronous
primitives
2 ways of sending data→ buffered option, unbuffered option
• buffered option→ copies data from user buffer to the kernel buffer,
later the data gets copied from kernel buffer onto the network.

• unbuffered option→ data gets copied directly from user buffer onto
the network.
Blocking/non-blocking, synchronous/asynchronous
primitives

• Send primitive uses the buffered option and unbuffered option.

• Receive primitive→ buffered option required because the data

already arrived when the primitive is invoked, and needs a storage

place in the kernel.


Blocking/non-blocking, synchronous/asynchronous
primitives

Synchronous (send/receive)
• Send and Receive are synchronous when establish a Handshake between
sender and receiver
• Send completes when Receive completes
• Receive completes when data copied into buffer

Asynchronous (send)
• Control returns to process when data copied out of user-specified buffer
Blocking/non-blocking, synchronous/asynchronous
primitives
Blocking (send/receive)
• Control returns to invoking process after processing of primitive (whether sync
or async) completes

Nonblocking (send/receive)
• Control returns to process immediately after invocation (even though operation
has not completed)
• Send: control returns to the process even before data copied out of user buffer
• Receive: control returns to the process even before data may have arrived from
sender
Processor synchrony
• Processor synchrony indicates that all the processors execute in lock-step with their
clocks synchronized
lock-step →similar devices share the same timing and triggering and essentially acting
as a single device

Libraries and standards


• Many commercial software products (banking, payroll, etc., applications) use
proprietary primitive libraries supplied with the software marketed by the vendors
(e.g., the IBM CICS software which has a very widely installed customer base
worldwide uses its own primitives).
• The message-passing interface (MPI) library and the PVM (parallel virtual machine)
library are used largely by the scientific community, but other alternative libraries
exist
Synchronous versus asynchronous executions
• In addition to the two classifications of processor synchrony/asynchrony and of

synchronous/asynchronous communication primitives, there is another classification,

namely that of synchronous/asynchronous executions.

Asynchronous execution

• ) No processor synchrony, no bound on drift rate of clocks

• ) Message delays finite but unbounded

• ) No bound on time for a step at a process


Synchronous versus asynchronous executions
Synchronous versus asynchronous executions
Sync execution
• ) Processors are synchronized; clock drift rate bounded
• ) Message delivery occurs in one logical step/round
• ) Known upper bound on time to execute a step at a process
EMULATION:Synchronous versus asynchronous executions
• Already discussed how shared memory system could be emulated by a message-passing system,
and vice-versa.
• We now have four broad classes of programs, as shown in Figure 1.11.
• Using the emulations shown, any class can be emulated by any other.
• If system A can be emulated by system B, denoted A/B, and if a problem is not solvable in B, then
it is also not solvable in A.
• Likewise, if a problem is solvable in A, it is also solvable in B.
• Hence, in a sense, all four classes are equivalent in terms of computability” – what can and
cannot be computed – in failure-free systems.
Challenges of Distributed system: System Perspective
• Communication mechanisms: E.g., Remote Procedure Call (RPC),
remote object invocation (ROI), message-oriented vs. stream-
oriented communication

• Processes: Code migration, process/thread management at clients


and servers, design of software and mobile agents
Challenges of Distributed system: System Perspective

• Naming: Easy to use identifiers needed to locate resources and


processes transparently and scalably

• Synchronization: synchronization or coordination among processes


are essential.
Mutual exclusion is the classical example of
synchronization
Challenges of Distributed system: System Perspective

• Data storage and access

• Schemes for data storage, search, and lookup should be fast and
scalable across network

• Consistency and replication

• Replication for fast access, scalability, avoid bottlenecks

• Require consistency management among replicas


Challenges of Distributed system: System
Perspective
• Fault-tolerance: recover from failure;despite link, node, process
failures

• Distributed systems security


•) Secure channels, access control, key management (key
generation and key distribution), authorization, secure group
management
Challenges of Distributed system: System
Perspective
• Scalability and modularity of algorithms, data, services
The algorithms, data (objects), and services must be as
distributed as possible.
Challenges of Distributed system: System Perspective

• API for communications, services: ease of use(non-


technical users)
• Transparency: hiding implementation policies from user
•) Access: hide differences in data representation across systems,
provide uniform operations to access resources
• ) Location: locations of resources are transparent
• ) Migration: relocate resources without renaming
Challenges of Distributed system: System Perspective

• Replication: hide replication from the users

• Concurrency: mask the use of shared resources

• Failure: reliable and fault-tolerant operation


• API for communications, services: ease of use(non-technical users)

• Transparency: hiding implementation policies from user


• ) Access: hide differences in data representation across systems, provide
uniform operations to access resources
• ) Location: locations of resources are transparent
• ) Migration: relocate resources without renaming
• ) Relocation: relocate resources as they are being accessed
• ) Replication: hide replication from the users
• ) Concurrency: mask the use of shared resources
• ) Failure: reliable and fault-tolerant operation
Chapter 2:
A Model of Distributed Computations
• A model of distributed computations: A distributed program –A
model of distributed executions –Models of communication networks
–Global state – Cuts of a distributed computation..
A Distributed Program
• A distributed program is composed of a set of n asynchronous

processes, p1, p2, ..., pi , ..., pn.

• The processes do not share a global memory and communicate

solely by passing messages.


A Distributed Program

• Process execution and message transfer are asynchronous.

• Assume that each process is running on a different processor.

• Let Cij denote the channel from process pi to process pj and let mij

denote a message sent by pi to pj .

• The message transmission delay is finite and unpredictable.


A Model of Distributed Executions
• The execution of a process consists of a sequential execution of its
actions.

• The actions are atomic and the actions of a process are modeled as
three types of events.
• internal events
• message send events,
• message receive events.
A Model of Distributed Executions

• For a message m, let send(m) & rec(m) denote send and receive

events, respectively.

• The occurrence of events changes the states of respective processes

and channels, thus causing transitions in the global system state.

• An internal event changes the state of the process at which it occurs.


A Model of Distributed Executions

• A send event changes

• the state of the process that sends or receives and

• the state of the channel on which the message is sent.

• An internal event only affects the process at which it occurs.


A Model of Distributed Executions
A Model of Distributed Executions
• For every message m that is exchanged between two processes, have
Send(m)→msg rec(m)

• Relation →msg defines causal dependencies between send and


receive events.
• Fig 2.1 shows a distributed execution using space–time diagram that
involves three processes.

• A horizontal line represents the progress of the process; a dot


indicates an event; a slant arrow indicates a message transfer.
A Model of Distributed Executions

• In this figure, for process p1, the 2nd event is a message send event, the 3rd

event is an internal event, and the 4th event is a message receive event.
Causal Precedence Relation
• The execution of a distributed application results in a set of distributed
events produced by the processes.
• Let H=∪i hi denote the set of events executed in a distributed computation.
• Next Define a binary relation → on the set H as follows that expresses
causal dependencies between events in the distributed execution.
• Concurrent events
• Logical vs. physical concurrency
Models of communication networks
Models of the service provided by communication networks are

• In the FIFO model, each channel acts as a first-in first-out message queue
and thus, message ordering is preserved by a channel.

• In the non-FIFO model, a channel acts like a set in which the sender
process adds messages and the receiver process removes messages from
it in a random order.
Models of communication networks

• The “causal ordering” model is based on Lamport’s “happens before”


relation. A system that supports the causal ordering model satisfies
the following property:
Models of communication networks
• Causally ordered delivery of messages implies FIFO message delivery.

• Note that CO ⊂ FIFO ⊂ Non-FIFO.

• Causal ordering model is useful in developing distributed algorithms.

• Example: Replicated database systems, every process that updates a


replica must receives updates in the same order to maintain database
consistency.
Models of communication networks
Global state of a distributed system

• The global state of a distributed system is a collection of the local states of its

components, namely, the processes and the communication channels

• The state of a process at any time is defined by the contents of processor

registers, stacks, local memory, etc. and depends on the local context of the

distributed application.
Models of communication networks

• The state of a channel is given by the set of messages in transit in the


channel.

• The occurrence of events changes the states of respective processes


and channels, thus causing transitions in global system state.
Models of communication networks

• For example, an internal event changes the state of the process at


which it occurs.

• A send event (or a receive event) changes the state of the process
that sends (or receives) the message and the state of the channel on
which the message is sent (or received).
Models of communication networks

• LSi0 denotes the initial state of process pi.

• LSix is a result of the execution of all the events executed by process


pi till eix.
Models of communication networks
• Let SCijx,y denote the state of a channel Cij defined as follows:

• Thus, channel state SCijx,y denotes all messages that pi sent up to


event eix and which process pj had not received until event ejy.
Global state

• The global state GS of a distributed system is a collection of the local


states of the processes and the channels is defined as
Global state
• For a global snapshot, the states of all the components of the
distributed system must be recorded at the same instant.

• This is possible if the local clocks at processes were perfectly


synchronized by the processes.

• Basic idea is that a message cannot be received if it was not sent


i.e., the state should not violate causality. Such states are called
consistent global states.
Global state
• Inconsistent global states are not meaningful in a distributed system.
Models of communication networks
global state GS consisting of local states
{LS11, LS23, LS33 , LS42} is inconsistent
{LS12, LS24, LS34, LS42} is consistent;
Cuts of a distributed computation

• A consistent global state corresponds to a cut in which every message


received in the PAST of the cut was sent in the PAST of that cut. Such
a cut is known as a consistent cut.

• All messages that cross the cut from the PAST to the FUTURE are in
transit in the corresponding consistent global state.

• A cut is inconsistent if a message crosses the cut from the FUTURE to


the PAST.
• For example, the space–time diagram of Figure 2.3 shows two cuts,
C1 and C2.
• C1 is an inconsistent cut, whereas C2 is a consistent cut.
.

Past and Future Cones of an Event

Past Cone of an Event


An event ej could have been affected only by all events ei such that ei → ej .
In this situtaion, all the information available at ei could be made
accessible at ej .
All such events ei belong to the past of ej .
Let Past(ej ) denote all events in the past of ej in a computation (H, →).
Then,
Past(ej ) = {ei |∀ei∈H, ei → ej }.

Figure 2.4 (next slide) shows the past of an event ej .

. .
A Model of Distributed Computations 85/1
.

. . . Past and Future Cones of an Event


Figure 2.4: Illustration of past and future cones.

max(Pasti (ej )) min(Futurei (ej ))

PAST(ej ) FUTURE( ej )

e
j

. .
A Model of Distributed Computations 86/1
.

. . . Past and Future Cones of an Event

Let Pasti (ej ) be the set of all those events of Past(ej ) that are on
process pi .

Pasti (ej ) is a totally ordered set, ordered by the relation →i , whose


maximal element is denoted by max (Pasti (ej )).

max (Pasti (ej )) is the latest event at process pi that affected event ej
(Figure 2.4).

. .
A Model of Distributed Computations 87/1
.

. . . Past and Future Cones of an Event

Let Max Past(ej ) = (∀i ){max (Pasti (ej ))}.


S

Max Past(ej ) consists of the latest event at every process that affected event
ej and is referred to as the surface of the past cone of ej .
Past(ej ) represents all events on the past light cone that affect ej .
Future Cone of an Event
The future of an event ej , denoted by Future(ej ), contains all events ei that
are causally affected by ej (see Figure 2.4).
In a computation (H, →), Future(ej ) is defined as:
Future(ej ) = {ei |∀ei∈ H, ej → ei }.

. .
A Model of Distributed Computations 88/1
.

. . . Past and Future Cones of an Event

Define Futurei (ej ) as the set of those events of Future(ej ) that are on process
pi .
define min(Futurei (ej )) as the first event on process pi that is affected by ej .
Define Min Future(ej ) as S(∀i ){min(Futurei (ej ))}, which consists of the first event
at every process that is causally affected by event ej .
Min Future(ej ) is referred to as the surface of the future cone of ej .
All events at a process pi that occurred after max (Pasti (ej )) but before
min(Futurei (ej )) are concurrent with ej .
Therefore, all and only those events of computation H that belong to the set
“H − Past(ej ) − Future(ej )” are concurrent with event ej .

. .
A Model of Distributed Computations 89/1
.

Models of Process Communications


• There are two basic models of process communications – synchronous and
asynchronous.
• synchronous communication model is a blocking type where on a message send,
the sender process blocks until the message has been received by the receiver
process.
• The sender process resumes execution only after it learns that the receiver process
has accepted the message.
• Thus, the sender and the receiver processes must synchronize to exchange a
message.
• Asynchronous communication model is a non-blocking type where the sender and
the receiver do not synchronize to exchange a message.
• After having sent a message, the sender process does not wait for the message
to be delivered to the receiver process.
• The message is bufferred by the system and is delivered to the receiver process
when it is ready to accept the message.
. .
A Model of Distributed Computations 90/1
.

. . . Models of Process Communications


Neither of the communication models is superior to the other.
Asynchronous communication provides higher parallelism because the sender process can
execute while the message is in transit to the receiver.
However, A buffer overflow may occur if a process sends a large number of messages in a
burst to another process.
Thus, an implementation of asynchronous communication requires more complex
buffer management.
In addition, due to higher degree of parallelism and non-determinism, it is much more
difficult to design, verify, and implement distributed algorithms for asynchronous
communications.
Synchronous communication is simpler to handle and implement.
However, due to frequent blocking, it is likely to have poor performance and is likely to be
more prone to deadlocks.

. .
A Model of Distributed Computations 91/1
Chapter 3-Logical time
• . Logical Time: A framework for a system of logical clocks –Scalar time
–Vector time – Physical clock synchronization: NTP.
Introduction
• The concept of causality between events is fundamental to the design and
analysis of parallel and distributed computing and operating systems.

• Usually causality is tracked using physical time.

• In distributed systems, it is not possible to have a global physical time.

• As asynchronous distributed computations make progress in spurts(sudden),


the logical time is sufficient to capture the fundamental monotonicity property
associated with causality in distributed systems.

• Three ways to implement logical time - scalar time, vector time, and matrix time.
A Framework for a System of Logical Clocks

• Definition
• Implementing logical clocks
Definition
• A system of logical clocks consists of a time domain T and a logical clock C .
• Elements of T form a partially ordered set over a relation <
• This Relation < is called the happened before or causal precedence.
• The logical clock C is a function that maps an event e in a distributed system to an
element in the time domain T , denoted as C(e) and called the timestamp of e, and
is defined as follows:
C : H ›→ T
• such that the following property is satisfied:
for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).
• This monotonicity property is called the clock consistency condition.
• When T and C satisfy the following condition,
for two events ei and ej , ei → ej ⇔ C(ei ) < C(ej )
the system of clocks is said to be strongly consistent.
Implementing Logical Clocks
• Implementation of logical clocks requires addressing two issues:
• Data structures local to every process to represent logical time
• Protocol to update the data structures to ensure the consistency condition.
• Each process pi maintains data structures that allow the following two
capabilities:
• The protocol ensures that a process’s logical clock, and its view of the
global time, is managed consistently.

• The protocol consists of the following two rules:


• R1: This rule governs how the local logical clock is updated by a process when it
executes an event.
• R2: This rule governs how a process updates its global logical clock to update its
view of the global time and global progress.
• Systems of logical clocks differ in their representation of logical time
and also in the protocol to update the logical clocks.
Scalar Time
• Proposed by Lamport in 1978 as an attempt to totally order events in a distributed system.

• Time domain is the set of non-negative integers.

• The logical local clock of a process pi and its local view of the global time are squashed
into one integer variable Ci .

• Rules R1 and R2 to update the clocks are as follows:


• R1: Before executing an event (send, receive, or internal), process pi executes the following:
• Ci := Ci + d (d > 0)
• In general, every time R1 is executed, d can have a different value;
• however, typically d is kept at 1.
• R2: Each message piggybacks the clock value of its sender at sending time.
• When a process pi receives a message with timestamp Cmsg , it executes the following
actions:
• Ci := max (Ci , Cmsg )
• Execute R1.
• Deliver the message.
• Figure 3.1 shows evolution of scalar time with d=1
Basic Properties
• Consistency Property
• Scalar clocks satisfy the monotonicity and hence the consistency property:
for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).
• Total Ordering
• Scalar clocks can be used to totally order events in a distributed system.
• The main problem in totally ordering events is that two or more events at different
processes may have identical timestamp.
• For example in Figure 3.1, the third event of process P1 and the second event of process P2
have identical scalar timestamp.
• A tie-breaking mechanism is needed to order such events.
• A tie is broken as follows:

• Process identifiers are linearly ordered and tie among events with identical scalar
timestamp is broken on the basis of their process identifiers.
• The lower the process identifier in the ranking, the higher the priority.
• The timestamp of an event is denoted by a tuple (t, i ) where t is its time of
occurrence and i is the identity of the process where it occurred.
• The total order relation ≺ on two events x and y with timestamps (h,i) and (k,j),
respectively, is defined as follows:

• x ≺ y ⇔ (h < k or (h = k and i < j))


• Event counting
• If the increment value d is always 1, the scalar time has the following interesting
property:
• if event e has a timestamp h, then h-1 represents the minimum logical duration,
counted in units of events, required before producing the event e;
• We call it the height of the event e.
• In other words, h-1 events have been produced sequentially before the event e
regardless of the processes that produced these events.
• For example, in Figure 3.1, five events precede event b on the longest causal path
ending at b.
.

Vector Time
• The system of vector clocks was developed independently by Fidge,
Mattern and Schmuck.
• In the system of vector clocks, the time domain is represented by a set of
• n-dimensional non-negative integer vectors.
• Each process pi maintains a vector vti [1..n], where vti [i ]is the local logical clock of pi and
describes the logical time progress at process pi .
• vti [j] represents process pi ’s latest knowledge of process pj local time.
• If vti [j]=x , then process pi knows that local time at process pj has progressed till x .
• The entire vector vti constitutes pi ’s view of the global logical time and is used to timestamp
events.

. Logical Time . 105/67


.

Vector Time
Process pi uses the following two rules R1 and R2 to update its clock:
R1: Before executing an event, process pi updates its local logical time as
follows:
vti [i] := vti [i] + d (d > 0)
R2: Each message m is piggybacked with the vector clock vt of the sender
process at sending time. On the receipt of such a message (m,vt), process pi
executes the following sequence of actions:
◮ Update its global logical time as follows:

1 ≤ k ≤ n : vti [k] := max(vti [k], vt[k])

◮ Execute R1.
◮ Deliver the message m.

. Logical Time . 106/67


.

Vector Time

The timestamp of an event is the value of the vector clock of its process
when the event is executed.
Figure 3.2 shows an example of vector clocks progress with the increment
value d=1.
Initially, a vector clock is [0, 0, 0, ....,0].

. Logical Time . 107/67


.

Vector Time
An Example of Vector Clocks

1 2 3 4 5
0 0 0 3 3
0 0 0 4 4
p
1 5
2 2 3
0 3 4
0 0 2 2 2 4 5
1 2 3 4 6
0 0 0 0 4
p
2 5
2
3 5
0 4
0 2 2 2
0 3 3 3
1 2 3 4
p
3

Figure 3.2: Evolution of vector time.

. Logical Time . 108/67


.

Vector Time
Comparing Vector Timestamps

The following relations are defined to compare two vector timestamps, vh


and vk :
vh = vk ⇔ ∀x: vh[x ] = vk [x ]
vh ≤ vk ⇔ ∀x: vh[x ] ≤ vk [x ]
vh < vk ⇔ vh ≤ vk and ∃x : vh[x ] < vk [x
]
vh ǁ vk ⇔ ¬(vh < vk ) ∧ ¬(vk < vh)
If the process at which an event occurred is known, the test to compare two
timestamps can be simplified as follows: If events x and y respectively
occurred at processes pi and pj and are assigned timestamps vh and vk,
respectively, then

x→y ⇔vh[i ] ≤ vk [i ]
xǁy ⇔vh[i ] > vk [i ] ∧ vh[j] < vk [j]

. Logical Time . 109/67


.

Vector Time
Properties of Vectot Time

Isomorphism
If events in a distributed system are timestamped using a system of vector
clocks, we have the following property.
If two events x and y have timestamps vh and vk, respectively, then

x→y ⇔vh < vk x


ǁy ⇔vh ǁ vk.

Thus, there is an isomorphism between the set of partially ordered events


produced by a distributed computation and their vector timestamps.

. Logical Time . 110/67


.

Vector Time
Strong Consistency
The system of vector clocks is strongly consistent; thus, by examining the
vector timestamp of two events, we can determine if the events are causally
related.
However, Charron-Bost showed that the dimension of vector clocks cannot be
less than n, the total number of processes in the distributed computation, for
this property to hold.
Event Counting
If d=1 (in rule R1), then the i th component of vector clock at process pi ,
vti [i ], denotes the number of events that have occurred at pi until that
instant.
So, if an event e has timestamp vh, vh[j] denotes the number
Σ of events
executed by process pj that causally precede e. Clearly, vh[j] − 1
represents the total number of events that causally precede e in the
distributed computation.

. Logical Time . 111/67


.

Physical Clock Synchronization: NTP


Motivation
In centralized systems, there is only single clock. A process gets the time by
simply issuing a system call to the kernel.
In distributed systems, there is no global clock or common memory. Each
processor has its own internal clock and its own notion of time.
These clocks can easily drift seconds per day, accumulating significant errors
over time.
Also, because different clocks tick at different rates, they may not remain
always synchronized although they might be synchronized when they start.
This clearly poses serious problems to applications that depend on a
synchronized notion of time.

. Logical Time . 112/67


.

Physical Clock Synchronization: NTP


Motivation

For most applications and algorithms that run in a distributed system, we


need to know time in one or more of the following contexts:
◮ The time of the day at which an event happened on a specific machine in the
network.
◮ The time interval between two events that happened on different machines in
the network.
◮ The relative ordering of events that happened on different machines in the
network.
Unless the clocks in each machine have a common notion of time, time-based
queries cannot be answered.
Clock synchronization has a significant effect on many problems like secure
systems, fault diagnosis and recovery, scheduled operations, database
systems, and real-world clock values.

. Logical Time . 113/67


.

Physical Clock Synchronization: NTP

Clock synchronization is the process of ensuring that physically distributed


processors have a common notion of time.
Due to different clocks rates, the clocks at various sites may diverge with
time and periodically a clock synchronization must be performed to correct
this clock skew in distributed systems.
Clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time).
Clocks that must not only be synchronized with each other but also have to
adhere to physical time are termed physical clocks.

. Logical Time . 114/67


.

Physical Clock Synchronization: NTP


Definitions and Terminology
Let Ca and Cb be any two clocks.
Time: The time of a clock in a machine p is given by the function Cp (t),
where Cp (t) = t for a perfect clock.
Frequency: Frequency is the rate at ′which a clock progresses. The
frequency at time t of clock Ca is Ca (t).
Offset: Clock offset is the difference between the time reported by a clock
and the real time. The offset of the clock Ca is given by Ca (t) − t. The
offset of clock Ca relative to Cb at time t ≥ 0 is given by Ca(t) − Cb (t).
Skew: The skew of a clock is the difference in the frequencies of the clock
and the perfect clock. The skew of a clock Ca relative to clock Cb at time t
is (Ca′ (t) − Cb ′ (t)). If the skew is bounded by ρ, then as per Equation (1),
clock values are allowed to diverge at a rate in the range of 1 − ρ to 1 + ρ.
Drift (rate): The drift of clock Ca is the second derivative of the clock value
with respect to time, namely, C ′′a(t). The drift of clock Ca relative to clock
Cb at time t is Ca (t)
′′ − Cb (t).
′′

. Logical Time . 115/67


.

Physical Clock Synchronization: NTP


Clock Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time).
However, due to the clock inaccuracy discussed above, a timer (clock) is said
to be working within its specification if (where constant ρ is the maximum
skew rate specified by the manufacturer.)
dC
1−ρ≤ ≤ 1+ ρ (1)
dt
Figure 3.5 illustrates the behavior of fast, slow, and perfect clocks with
respect to UTC.

. Logical Time . 116/67


.

Physical Clock Synchronization: NTP


Fast Clock
dC/dt > 1
Perfect Clock
dC/dt = 1

Clock time, C
Slow Clock
dC/dt < 1

UTC, t

Figure 3.5: The behavior of fast, slow, and perfect clocks with respect to UTC.

. Logical Time . 117/67


.

Physical Clock Synchronization: NTP

Offset delay estimation method


The Network Time Protocol (NTP) which is widely used for clock
synchronization on the Internet uses the The Offset Delay Estimation
method.
The design of NTP involves a hierarchical tree of time servers.
◮ The primary server at the root synchronizes with the UTC.
◮ The next level contains secondary servers, which act as a backup to the
primary server.
◮ At the lowest level is the synchronization subnet which has the clients.

. Logical Time . 118/67


.

Physical Clock Synchronization: NTP

Clock offset and delay estimation:


In practice, a source node cannot accurately estimate the local time on the target
node due to varying message or network delays between the nodes.
This protocol employs a common practice of performing several trials and
chooses the trial with the minimum delay.
Figure 3.6 shows how NTP timestamps are numbered and exchanged
between peers A and B.
Let T1, T2, T3, T4 be the values of the four most recent timestamps as shown.
Assume clocks A and B are stable and running at the same speed.

. Logical Time . 119/67


.

T1 T2
B

A
T3 T4

Figure 3.6: Offset and delay estimation.

. Logical Time . 120/67


.

Let a = T1 − T3 and b = T2 − T4.


If the network delay difference from A to B and from B to A, called
differential delay, is small, the clock offset θand roundtrip delay δof B
relative to A at time T4 are approximately given by the following.
a+ b
θ= , δ= a−b (2)
2
Each NTP message includes the latest three timestamps T1, T2 and T3,
while T4 is determined upon arrival.
Thus, both peers A and B can independently calculate delay and offset using a
single bidirectional message stream as shown in Figure 3.7.

. Logical Time . 121/67


.

Server A Ti-2 Ti-1

Server B Ti-3 Ti

Figure 3.7: Timing diagram for the two servers.

. Logical Time . 122/67

You might also like