0% found this document useful (0 votes)

139 views272 pages

Dos PPT PDF

The document discusses various techniques for maintaining causal ordering and logical clocks in distributed systems. It describes Lamport's logical clocks which use happened-before relations and clock values to order events. It also describes vector clocks which can better capture causal relationships between processes. Finally, it explains two algorithms - BSS and SES - that use vector clocks and message broadcasting or direct sending to ensure messages are delivered in a causally consistent order.

Uploaded by

Aditya Chimankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views272 pages

Dos PPT PDF

Uploaded by

Aditya Chimankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 272

Theoretical Aspects

 Logical Clocks
 Causal Ordering
 Global State Recording
 Termination Detection

B. Prabhakaran 1
Lamport’s Clock
 Happened before relation:
 a -> b : Event a occurred before event b. Events in the same
process.
 a -> b : If a is the event of sending a message m in a process
and b is the event of receipt of the same message m by another
process.
 a -> b, b -> c, then a -> c. “->” is transitive.
 Causally Ordered Events
 a -> b : Event a “causally” affects event b
 Concurrent Events
 a || b: if a !-> b and b !-> a

B. Prabhakaran 2
Space-time Diagram
Space P1 e11 e12 e13 e14

Internal
Events
Messages

P2
e21 e22 e23 e24
Time

B. Prabhakaran 3
Logical Clocks
 Conditions satisfied:
 Ci is clock in Process Pi.
 If a -> b in process Pi, Ci(a) < Ci(b)
 Let a: sending message m in Pi; b : receiving message m in Pj;
then, Ci(a) < Cj(b).
 Implementation Rules:
 R1: Ci = Ci + d (d > 0); clock is updated between two
successive events.
 R2: Cj = max(Cj, tm+ d); (d > 0); When Pj receives a message
m with a time stamp tm (tm assigned by Pi, the sender; tm =
Ci(a), a being the event of sending message m).
 A reasonable value for d is 1

B. Prabhakaran 4
Space-time Diagram
Space P1 e11 e12 e13 e14 e15 e16 e17
(1) (2) (3) (4) (5) (6) (7)

(1) (2) (3) (4) (7)

P2 /
e21 e22 e23 e24 e25
Time

B. Prabhakaran 5
Limitation of Lamport’s Clock
Space P1 e11 e12 e13
(1) (2) (3)

(1) (3) (4)

P2
e21 e22 e23

P3 e31 e32 e33

(1) (2)
Time
C(e11) < C(e32) but not causally related.
This inter-dependency not reflected in Lamport’s Clock.
B. Prabhakaran 6
Vector Clocks
 Keep track of transitive dependencies among processes
for recovery purposes.
 Ci[1..n]: is a “vector” clock at process Pi whose entries
are the “assumed”/”best guess” clock values of different
processes.
 Ci[j] (j != i) is the best guess of Pi for Pj’s clock.
 Vector clock rules:
 Ci[i] = Ci[i] + d, (d > 0); for successive events in Pi
 For all k, Cj[k] = max (Cj[k],tm[k]), when a message m with
time stamp tm is received by Pj from Pi.
: For all

B. Prabhakaran 7
Vector Clocks Comparison
1. Equal: ta = tb iff i, ta[i] = tb[i]
2. Not Equal: ta != tb iff ta[i] != tb[i], for at least one i
3. Less than or equal: ta <= tb iff ta[i] <= tb[i], for all i
4. Less than : ta < tb iff ta[i] <= tb[i] and ta[i] != tb[i], for all i
5. Concurrent: ta || tb iff ta !< tb and tb !< ta
6. Not less than or equal ...
7. Not less than ..

B. Prabhakaran 8
Vector Clock …
Space P1 e11 e12 e13
(1,0,0) (2,0,0) (3,4,1)

P2 (0,1,0) (2,2,0) (2,3,1) (2,4,1)

e21 e22 e23 e24

P3 e31 e32
(0,0,1) (0,0,2)
Time

B. Prabhakaran 9
Causal Ordering of Messages
Space Send(M1)
P1

Send(M2)
P2

P3 (1)

(2)
Time

B. Prabhakaran 10
Message Ordering …
 Not really worry about maintaining clocks.
 Order the messages sent and received among all processes
in a distributed system.
 (e.g.,) Send(M1) -> Send(M2), M1 should be received
ahead of M2 by all processes.
 This is not guaranteed by the communication network since
M1 may be from P1 to P2 and M2 may be from P3 to P4.
 Message ordering:
 Deliver a message only if the preceding one has already been
delivered.
 Otherwise, buffer it up.

B. Prabhakaran 11
BSS Algorithm

 BSS: Birman-Schiper-Stephenson Protocol

 Broadcast based: a message sent is received by all other
processes.
 Deliver a message to a process only if the message
preceding it immediately, has been delivered to the
process.
 Otherwise, buffer the message.
 Accomplished by using a vector accompanying the
message.

B. Prabhakaran 12
BSS Algorithm ...
1. Process Pi increments the vector time VTpi[i], time stamps,
and broadcasts the message m. VTpi[i] - 1 denotes the number
of messages preceding m.
2. Pj != Pi receives m. m is delivered when:
a. VTpj[i] == VTm[i] - 1
b. VTpj[k] >= VTm[k] for all k in {1,2,..n} - {i}, n is the
total number of processes. Delayed message are queued
in a sorted manner.
c. Concurrent messages are ordered by time of receipt.
3. When m is delivered at Pj, VTpj updated according Rule 2 of
vector clocks.
2(a) : Pj has received all Pi’s messages preceding m.
2(b): Pj has received all other messages received by Pi
before sending m.
B. Prabhakaran 13
BSS Algorithm …
P1 (0,0,0) (buffer) (0,0,1) (0,1,1)

deliver
from buffer
P2 (0,0,1) (0,1,1)

P3
(0,0,1) (0,1,1)

B. Prabhakaran 14
SES Algorithm
 SES: Schiper-Eggli-Sandoz Algorithm. No need for
broadcast messages.
 Each process maintains a vector V_P of size N - 1, N
the number of processes in the system.
 V_P is a vector of tuple (P’,t): P’ the destination
process id and t, a vector timestamp.
 Tm: logical time of sending message m
 Tpi: present logical time at pi
 Initially, V_P is empty.

B. Prabhakaran 15
SES Algorithm
 Sending a Message:
 Send message M, time stamped tm, along with V_P1 to P2.
 Insert (P2, tm) into V_P1. Overwrite the previous value of
(P2,t), if any.
 (P2,tm) is not sent. Any future message carrying (P2,tm) in
V_P1 cannot be delivered to P2 until tm < tP2.
 Delivering a message
 If V_M (in the message) does not contain any pair (P2, t), it can
be delivered.
 /* (P2, t) exists */ If t !< Tp2, buffer the message. (Don’t
deliver).
 else (t < Tp2) deliver it

B. Prabhakaran 16
SES Algorithm ...
 What does the condition t ≥ Tp2 imply?
 t is message vector time stamp.
 t > Tp2 -> For all j, t[j] > Tp2[j]
 This implies some events occurred without P2’s knowledge in
other processes. So P2 decides to buffer the message.
 When t < Tp2, message is delivered & Tp2 is updated
with the help of V_P2 (after the merge operation).

B. Prabhakaran 17
SES Buffering Example
P1 (0,0,0) Tp1: (1,1,0) (2,2,2)

V_P2:
Tp2: (P1, <0,1,0>)
P2 (0,1,0) (0,2,0) (P3, <0,2,0>)
M1 M2
V_P2 V_P2:
empty (P1, <0,1,0>) V_P3:
P3 M3 M4 (P1,<0,2,2>)
Tp3: (0,2,1) (0,2,2) V_P3:
(P1,<0,1,0>)

B. Prabhakaran 18
SES Buffering Example...
 M1 from P2 to P1: M1 + Tm (=<0,1,0>) + Empty V_P2
 M2 from P2 to P3: M2 + Tm (<0, 2, 0>) + (P1, <0,1,0>)
 M3 from P3 to P1: M3 + <0,2,2> + (P1, <0,1,0>)
 M3 gets buffered because:
 Tp1 is <0,0,0>, t in (P1, t) is <0,1,0> & so Tp1 < t
 When M1 is received by P1:
 Tp1 becomes <1,1,0>, by rules 1 and 2 of vector clock.
 After updating Tp1, P1 checks buffered M3.
 Now, Tp1 > t [in (P1, <0,1,0>].
 So M3 is delivered.

B. Prabhakaran 19
SES Algorithm ...
 On delivering the message:
 Merge V_M (in message) with V_P2 as follows.
 If (P,t) is not there in V_P2, merge.
 If (P,t) is present in V_P2, t is updated with max(t[i] in Vm,
t[i] in V_P2). {Component-wise maximum}.
 Message cannot be delivered until t in V_M is less
than t in V_P2
 Update site P2’s local, logical clock.
 Check buffered messages after local, logical clock
update.

B. Prabhakaran 20
SES Algorithm …
P1 (1,2,1) (2,2,1)

P2 (0,1,1) M2
(0,2,1) V_P2 is
empty

P3
M1 (0,2,2)
(0,0,1)
V_P3 is
empty

B. Prabhakaran 21
Handling Multicasts
 Each node can maintain n x n matrix M, n being the
number of processes.
 Node i multicasts to j and k: increments Mi[i,j] and
Mi[i,k]. M sent along with the message.
 When node j receives message m from i, it can be
delivered if and only if:
 Mj[i,j] = Mm[i,j] - 1
 Mj[k,j] >= Mm[k,j] for all k != i.
 Else buffer the message
 On message delivery: Mj[x,y] = max(Mj[x,y], Mm[x,y])

B. Prabhakaran 22
Handling Multicasts: Example
000 000 000 000
000 101 000 101
P1 000 110 110 110

000
000
P2 110 M2
000
101
110
P3
M1 000
000 101
000 110
110

B. Prabhakaran 23
Global State
Global State 1
C1: Empty
$500 $200
A C2: Empty B
Global State 2
C1: Tx $50
$450 $200
A C2: Empty B
Global State 3
C1: Empty
$450 $250
A C2: Empty B

B. Prabhakaran 24
Recording Global State...
 (e.g.,) Global state of A is recorded in (1) and not in (2).
 State of B, C1, and C2 are recorded in (2)
 Extra amount of $50 will appear in global state
 Reason: A’s state recorded before sending message and C1’s state
after sending message.
 Inconsistent global state if n < n’, where
 n is number of messages sent by A along channel before A’s state
was recorded
 n’ is number of messages sent by A along the channel before
channel’s state was recorded.
 Consistent global state: n = n’

B. Prabhakaran 25
Recording Global State...
 Similarly, for consistency m = m’
 m’: no. of messages received along channel before B’s state recording
 m: no. of messages received along channel by B before channel’s state was
recorded.
 Also, n’ >= m, as in no system no. of messages sent along the
channel be less than that received
 Hence, n >= m
 Consistent global state should satisfy the above equation.
 Consistent global state:
 Channel state: sequence of messages sent before recording sender’s state,
excluding the messages received before receiver’s state was recorded.
 Only transit messages are recorded in the channel state.

B. Prabhakaran 26
Recording Global State
 Send(Mij): message M sent from Si to Sj
 rec(Mij): message M received by Sj, from Si
 time(x): Time of event x
 LSi: local state at Si
 send(Mij) is in LSi iff (if and only if) time(send(Mij)) <
time(LSi)
 rec(Mij) is in LSj iff time(rec(Mij)) < time(LSj)
 transit(LSi, LSj) : set of messages sent/recorded at LSi
and NOT received/recorded at LSj

B. Prabhakaran 27
Recording Global State …
 inconsistent(LSi,LSj): set of messages NOT sent/recorded
at LSi and received/recorded at LSj
 Global State, GS: {LS1, LS2,…., LSn}
 Consistent Global State, GS = {LS1, ..LSn} AND for all i
in n, inconsistent(LSi,LSj) is null.
 Transitless global state, GS = {LS1,…,LSn} AND for all
i in n, transit(LSi,LSj) is null.

B. Prabhakaran 28
Recording Global State ..
LS1 M2
S1 M1

S2
LS2
M1: transit
M2: inconsistent

B. Prabhakaran 29
Recording Global State...
 Strongly consistent global state: consistent and transitless,
i.e., all send and the corresponding receive events are
recorded in all LSi.
LS11 LS12

LS22 LS23
LS21

LS31 LS32 LS33

B. Prabhakaran 30
Chandy-Lamport Algorithm
 Distributed algorithm to capture a consistent global state. Communication channels
assumed to be FIFO.
 Uses a marker to initiate the algorithm. Marker sort of dummy message, with no effect
on the functions of processes.
 Sending Marker by P:
 P records its state.
 For each outgoing channel C, P sends a marker on C before P sends further
messages along C.
 Receiving Marker by Q:
 If Q has NOT recorded its state: (a). Record the state of C as an empty sequence.
(b) SEND marker (use above rule).
 Else (Q has recorded state before): Record the state of C as sequence of messages
received along C, after Q’s state was recorded and before Q received the marker.
 FIFO channel condition + markers help in satisfying consistency
condition.

B. Prabhakaran 31
Chandy-Lamport Algorithm
 Initiation of marker can be done by any process, with its own unique
marker: <process id, sequence number>.
 Several processes can initiate state recording by sending markers.
Concurrent sending of markers allowed.
 One possible way to collect global state: all processes send the
recorded state information to the initiator of marker. Initiator process
can sum up the global state.
Seq

Si Sj

Sc
Seq’

B. Prabhakaran 32
Chandy-Lamport Algorithm ...
 Example:

Pi Pj Pk
Send Send
Record Marker Record Marker Record
channel channel channel
state state state
Channel state example: M1 sent to Px at t1, M2 sent to Py at t2, ….

B. Prabhakaran 33
Chandy-Lamport Algorithm ...
Pi

B. Prabhakaran 34
Cuts
 Cuts: graphical representation of a global state.
 Cut C = {c1, c2, .., cn}; ci: cut event at Si.
 Consistent Cut: If every message received by a Si before
a cut event, was sent before the cut event at Sender.
 One can prove: A cut is a consistent cut iff no two cut
events are causally related, i.e., !(ci -> cj) and !(cj -> ci).

c1 <3,2,5,4> VTc=<3,8 ,6,4>

S1
S2 c2<2,7,6,3>
<2,8,5,4>
S3 c3
c4
S4

B. Prabhakaran 35
Time of a Cut
 C = {c1, c2, .., cn} with vector time stamp VTci. Vector
time of the cut, VTc = sup(VTc1, VTc2, .., VTcn).
 sup is a component-wise maximum, i.e., VTci =
max(VTc1[i], VTc2[i], .., VTcn[i]).
 Now, a cut is consistent iff VTc = (VTc1[1], VTc2[2], ..,
VTcn[n]).

B. Prabhakaran 36
Termination Detection
 Termination: completion of the sequence of algorithm. (e.g.,) leader
election, deadlock detection, deadlock resolution.
 Use a controlling agent or a monitor process.
 Initially, all processes are idle. Weight of controlling agent is 1 (0 for
others).
 Start of computation: message from controller to a process. Weight: split
into half (0.5 each).
 Repeat this: any time a process send a computation message to another
process, split the weights between the two processes (e.g., 0.25 each for the
third time).
 End of computation: process sends its weight to the controller. Add this
weight to that of controller’s. (Sending process’s weight becomes 0).
 Rule: Sum of W always 1.
 Termination: When weight of controller becomes 1 again.

B. Prabhakaran 37
Huang’s Algorithm

 B(DW): computation message, DW is the weight.

 C(DW): control/end of computation message;
 Rule 1: Before sending B, compute W1, W2 (such that W1 + W2 is
W of the process). Send B(W2) to Pi, W = W1.
 Rule 2: Receiving B(DW) -> W = W + DW, process becomes active.
 Rule 3: Active to Idle -> send C(DW), W = 0.
 Rule 4: Receiving C(DW) by controlling agent -> W = W + DW, If
W == 1, computation has terminated.

B. Prabhakaran 38
Huang’s Algorithm
1/4 P1 0.5 P1

1/2 P2 P3 1/16 0.5 P2 P3 0

P4 P5 0 P4 0 P5

1/8 1/16

B. Prabhakaran 39
Mutual Exclusion Algorithms
• Non-token based:
• A site/process can enter a critical section when an
assertion (condition) becomes true.
• Algorithm should ensure that the assertion will be true
in only one site/process.
• Token based:
• A unique token (a known, unique message) is shared
among cooperating sites/processes.
• Possessor of the token has access to critical section.
• Need to take care of conditions such as loss of token,
crash of token holder, possibility of multiple tokens, etc.

B. Prabhakaran 1
General System Model
 At any instant, a site may have several requests for critical
section (CS), queued up, and serviced one at a time.
 Site States: Requesting CS, executing CS, idle (neither
requesting nor executing CS).
 Requesting CS: blocked until granted access, cannot
make additional requests for CS.
 Executing CS: using the CS.
 Idle: action is outside the site. In token-based approaches,
idle site can have the token.

B. Prabhakaran 2
Mutual Exclusion: Requirements
 Freedom from deadlocks: two or more sites should not
endlessly wait on conditions/messages that never become
true/arrive.
 Freedom from starvation: No indefinite waiting.
 Fairness: Order of execution of CS follows the order of
the requests for CS. (equal priority).
 Fault tolerance: recognize “faults”, reorganize, continue.
(e.g., loss of token).

B. Prabhakaran 3
Performance
 Number of messages per CS invocation: should be
minimized.
 Synchronization delay, i.e., time between the leaving of
CS by a site and the entry of CS by the next one: should
be minimized.
 Response time: time interval between request messages
transmissions and exit of CS.
 System throughput, i.e., rate at which system executes
requests for CS: should be maximized.
 If sd is synchronization delay, E the average CS execution
time: system throughput = 1 / (sd + E).

B. Prabhakaran 4
Performance metrics
Next site
Last site
enters CS
exits CS

Synchronization Time
delay

Messages
CS Request sent Enter CS Exit CS
arrives

E Time

Response Time
B. Prabhakaran 5
Performance ...
 Low and High Load:
 Low load: No more than one request at a given point in time.
 High load: Always a pending mutual exclusion request at a site.
 Best and Worst Case:
 Best Case (low loads): Round-trip message delay + Execution
time. 2T + E.
 Worst case (high loads).
 Message traffic: low at low loads, high at high loads.
 Average performance: when load conditions fluctuate
widely.

B. Prabhakaran 6
Simple Solution
 Control site: grants permission for CS execution.
 A site sends REQUEST message to control site.
 Controller grants access one by one.
 Synchronization delay: 2T -> A site release CS by
sending message to controller and controller sends
permission to another site.
 System throughput: 1/(2T + E). If synchronization delay
is reduced to T, throughput doubles.
 Controller becomes a bottleneck, congestion can occur.

B. Prabhakaran 7
Non-token Based Algorithms
 Notations:
 Si: Site I
 Ri: Request set, containing the ids of all Sis from which
permission must be received before accessing CS.
 Non-token based approaches use time stamps to order requests
for CS.
 Smaller time stamps get priority over larger ones.
 Lamport’s Algorithm
 Ri = {S1, S2, …, Sn}, i.e., all sites.
 Request queue: maintained at each Si. Ordered by time stamps.
 Assumption: message delivered in FIFO.

B. Prabhakaran 8
Lamport’s Algorithm
 Requesting CS:
 Send REQUEST(tsi, i). (tsi,i): Request time stamp. Place REQUEST in
request_queuei.
 On receiving the message; sj sends time-stamped REPLY message to
si. Si’s request placed in request_queuej.
 Executing CS:
 Si has received a message with time stamp larger than (tsi,i) from all
other sites.
 Si’s request is the top most one in request_queuei.
 Releasing CS:
 Exiting CS: send a time stamped RELEASE message to all sites in its
request set.
 Receiving RELEASE message: Sj removes Si’s request from its queue.

B. Prabhakaran 9
Lamport’s Algorithm…
 Performance.
 3(N-1) messages per CS invocation. (N - 1) REQUEST, (N - 1)
REPLY, (N - 1) RELEASE messages.
 Synchronization delay: T
 Optimization
 Suppress reply messages. (e.g.,) Sj receives a REQUEST
message from Si after sending its own REQUEST message with
time stamp higher than that of Si’s. Do NOT send REPLY
message.
 Messages reduced to between 2(N-1) and 3(N-1).

B. Prabhakaran 10
Lamport’s Algorithm: Example
Step 1:
S1 (2,1)

S2
(1,2)
S3
Step 2:
S1 (1,2) (2,1)

S2 enters CS
S2
(1,2) (2,1)
S3
(1,2) (2,1)
B. Prabhakaran 11
Lamport’s: Example…
Step 3:
S1 (1,2) (2,1)

S2 leaves CS
S2
(1,2) (2,1)
S3
(1,2) (2,1)
Step 4:
S1 (1,2) (2,1) (2,1)

S1 enters CS
S2
(1,2) (2,1) (2,1)
S3
(1,2) (2,1) (2,1)
B. Prabhakaran 12
Ricart-Agrawala Algorithm
 Requesting critical section
 Si sends time stamped REQUEST message
 Sj sends REPLY to Si, if
 Sj is not requesting nor executing CS
 If Sj is requesting CS and Si’s time stamp is smaller than its
own request.
 Request is deferred otherwise.

 Executing CS: after it has received REPLY from all sites in

its request set.
 Releasing CS: Send REPLY to all deferred requests. i.e., a
site’s REPLY messages are blocked only by sites with
smaller time stamps

B. Prabhakaran 13
Ricart-Agrawala: Performance
 Performance:
 2(N-1) messages per CS execution. (N-1) REQUEST + (N-1)
REPLY.
 Synchronization delay: T.
 Optimization:
 When Si receives REPLY message from Sj -> authorization to
access CS till
 Sj sends a REQUEST message and Si sends a REPLY
message.
 Access CS repeatedly till then.
 A site requests permission from dynamically varying set of
sites: 0 to 2(N-1) messages.

B. Prabhakaran 14
Ricart-Agrawala: Example
Step 1:
S1 (2,1)

S2
(1,2)

S3
Step 2:
S1

S2 enters CS
S2
(2,1)
S3
B. Prabhakaran 15
Ricart-Agrawala: Example…

Step 3:
S1

S1 enters CS
S2
(2,1)
S2 leaves CS
S3

B. Prabhakaran 16
Maekawa’s Algorithm
 A site requests permission only from a subset of sites.
 Request set of sites si & sj: Ri, Rj such that Ri and Rj will have
atleast one common site (Sk). Sk mediates conflicts between Ri
and Rj.
 A site can send only one REPLY message at a time, i.e., a site
can send a REPLY message only after receiving a RELEASE
message for the previous REPLY message.
 Request Sets Rules:
 Sets Ri and Rj have atleast one common site.
 Si is always in Ri.
 Cardinality of Ri, i.e., the number of sites in Ri is K.
 Any site Si is in K number of Ris. N = K(K - 1) + 1 -> K = square root of
N.

B. Prabhakaran 17
Maekawa’s Algorithm ...
 Requesting CS
 Si sends REQUEST(i) to sites in Ri.
 Sj sends REPLY to Si if
 Sj has NOT sent a REPLY message to any site after it
received the last RELEASE message.
 Otherwise, queue up Si’s request.

 Executing CS: after getting REPLY from all sites in Ri.

 Releasing CS
 send RELEASE(i) to all sites in Ri
 Any Sj after receiving RELEASE message, send REPLY
message to the next request in queue.
 If queue empty, update status indicating receipt of RELEASE.

B. Prabhakaran 18
Request Subsets
 Example k = 2; (N = 3).
 R1 = {1, 2}; R3 = {1, 3}; R2 = {2, 3}
 Example k = 3; N = 7.
 R1 = {1, 2, 3}; R4 = {1, 4, 5}; R6 = {1, 6, 7};
 R2 = {2, 4, 6}; R5 = {2, 5, 7}; R7 = {3, 4, 7};
 R3 = {3, 5, 6}
 Algorithm in Maekawa’s paper (uploaded in
Lecture Notes web page).

B. Prabhakaran 19
Maekawa’s Algorithm ...
 Performance
 Synchronization delay: 2T
 Messages: 3 times square root of N (one each for REQUEST,
REPLY, RELEASE messages)
 Deadlocks
 Message deliveries are not ordered.
 Assume Si, Sj, Sk concurrently request CS
 Ri intersection Rj = {Sij}, Rj Rk = {Sjk}, Rk Ri = {Ski}
 Possible that:
 Sij is locked by Si (forcing Sj to wait at Sij)

 Sjk by Sj (forcing Sk to wait at Sjk)

 Ski by Sk (forcing Si to wait at Ski)
 -> deadlocks among Si, Sj, and Sk .

B. Prabhakaran 20
Handling Deadlocks
 Si yields to a request if that has a smaller time stamp.
 A site suspects a deadlock when it is locked by a request
with a higher time stamp (lower priority).
 Deadlock handling messages:
 FAILED: from Si to Sj -> Si has granted permission to higher
priority request.
 INQUIRE: from Si to Sj -> Si would like to know Sj has
succeeded in locking all sites in Sj’s request set.
 YIELD: from Si to Sj -> Si is returning permission to Sj so that
Sj can yield to a higher priority request.

B. Prabhakaran 21
Handling Deadlocks
 REQUEST(tsi,i) to Sj:
 Sj is locked by Sk -> Sj sends FAILED to Si, if Si’s request has higher time
stamp.
 Otherwise, Sj sends INQUIRE(j) to Sk.
 INQUIRE(j) to Sk:
 Sk sends a YIELD (k) to Sj, if Sk has received a FAILED message from a
site in Sk’s set. (or) if Sk sent a YIELD and has not received a new REPLY.
 YIELD(k) to Sj:
 Sj assumes it has been released by Sk, places Sk’s request in its queue
appropriately, sends a REPLY(j) to the top request in its queue.
 Sites may exchange these messages even if there is no real
deadlock. Maximum number of messages per CS request: 5 times
square root of N.

B. Prabhakaran 22
Token-based Algorithms
 Unique token circulates among the participating sites.
 A site can enter CS if it has the token.
 Token-based approaches use sequence numbers instead of
time stamps.
 Request for a token contains a sequence number.
 Sequence number of sites advance independently.
 Correctness issue is trivial since only one token is present
-> only one site can enter CS.
 Deadlock and starvation issues to be addressed.

B. Prabhakaran 23
Suzuki-Kasami Algorithm
 If a site without a token needs to enter a CS, broadcast a REQUEST for
token message to all other sites.
 Token: (a) Queue of request sites (b) Array LN[1..N], the sequence
number of the most recent execution by a site j.
 Token holder sends token to requestor, if it is not inside CS. Otherwise,
sends after exiting CS.
 Token holder can make multiple CS accesses.
 Design issues:
 Distinguishing outdated REQUEST messages.
 Format: REQUEST(j,n) -> jth site making nth request.
 Each site has RNi[1..N] -> RNi[j] is the largest sequence number of request
from j.
 Determining which site has an outstanding token request.
 If LN[j] = RNi[j] - 1, then Sj has an outstanding request.

B. Prabhakaran 24
Suzuki-Kasami Algorithm ...
 Passing the token
 After finishing CS
 (assuming Si has token), LN[i] := RNi[i]
 Token consists of Q and LN. Q is a queue of requesting sites.
 Token holder checks if RNi[j] = LN[j] + 1. If so, place j in Q.
 Send token to the site at head of Q.
 Performance
 0 to N messages per CS invocation.
 Synchronization delay is 0 (if the token holder repeats CS) or T.

B. Prabhakaran 25
Suzuki-Kasami: Example
Step 1: S1 has token, S3 is in queue
Site Seq. Vector RN Token Vect. LN Token Queue
S1 10, 15, 9 10, 15, 8 3
S2 10, 16, 9
S3 10, 15, 9
Step 2: S3 gets token, S2 in queue
Site Seq. Vector RN Token Vect. LN Token Queue
S1 10, 16, 9
S2 10, 16, 9
S3 10, 16, 9 10, 15, 9 2
Step 3: S2 gets token, queue empty
Site Seq. Vector RN Token Vect. LN Token Queue
S1 10, 16, 9
S2 10, 16, 9 10, 16, 9 <empty>
S3 10, 16, 9

B. Prabhakaran 26
Singhal’s Heuristic Algorithm
 Instead of broadcast: each site maintains information on other sites, guess the
sites likely to have the token.
 Data Structures:
 Si maintains SVi[1..M] and SNi[1..M] for storing information on other sites: state
and highest sequence number.
 Token contains 2 arrays: TSV[1..M] and TSN[1..M].
 States of a site
 R : requesting CS
 E : executing CS
 H : Holding token, idle
 N : None of the above

 Initialization:
 SVi[j] := N, for j = M .. i; SVi[j] := R, for j = i-1 .. 1; SNi[j] := 0, j = 1..N. S1
(Site 1) is in state H.
 Token: TSV[j] := N & TSN[j] := 0, j = 1 .. N.

B. Prabhakaran 27
Singhal’s Heuristic Algorithm …
 Requesting CS
 If Si has no token and requests CS:
 SVi[i] := R. SNi[i] := SNi[i] + 1.
 Send REQUEST(i,sn) to sites Sj for which SVi[j] = R. (sn: sequence
number, updated value of SNi[i]).
 Receiving REQUEST(i,sn): if sn <= SNj[i], ignore. Otherwise, update
SNj[i] and do:
 SVj[j] = N -> SVj[i] := R.
 SVj[j] = R -> If SVj[i] != R, set it to R & send REQUEST(j,SNj[j]) to
Si. Else do nothing.
 SVj[j] = E -> SVj[i] := R.
 SVj[j] = H -> SVj[i] := R, TSV[i] := R, TSN[i] := sn, SVj[j] = N.
Send token to Si.
 Executing CS: after getting token. Set SVi[i] := E.

B. Prabhakaran 28
Singhal’s Heuristic Algorithm …
 Releasing CS
 SVi[i] := N, TSV[i] := N. Then, do:
 For other Sj: if (SNi[j] > TSN[j]), then {TSV[j] := SVi[j];
TSN[j] := SNi[j]}
 else {SVi[j] := TSV[j]; SNi[j] := TSN[j]}
 If SVi[j] = N, for all j, then set SVi[i] := H. Else send token to a
site Sj provided SVi[j] = R.
 Fairness of algorithm will depend on choice of Si, since no
queue is maintained in token.
 Arbitration rules to ensure fairness used.
 Performance
 Low to moderate loads: average of N/2 messages.
 High loads: N messages (all sites request CS).
 Synchronization delay: T.

B. Prabhakaran 29
Singhal: Example
Sn R R R R N
Sn R R R R N

S4 R R R N N
S4 R R R N N

S2 R N R N N
S3 R R N N N

S1 N N R N N
S2 R N N N N

N N N N
S1 H N N N N
S3 H
1 2 3 4 n
1 2 3 4 n

(a) Initial Pattern (b) Pattern after S3 gets the token from
S1.
Each row in the matrix has
increasing number of Rs. Stair case is pattern can be identified by
noting that S1 has 1 R and S2 has 2 Rs
and so on. Order of occurrence of R in
a row does not matter.

B. Prabhakaran 30
Singhal: Example…
• Assume there are 3 sites in the system. Initially:
Site 1: SV1[1] = H, SV1[2] = N, SV1[3] = N. SN1[1], SN1[2], SN1[3] are 0.
Site 2: SV2[1] = R, SV2[2] = N, SV2[3] = N. SNs are 0.
Site 3: SV3[1] = R, SV3[2] = R, SV3[3] = N. SNs are 0.
Token: TSVs are N. TSNs are 0.
• Assume site 2 is requesting token.
S2 sets SV2[2] = R, SN2[2] = 1.
S2 sends REQUEST(2,1) to S1 (since only S1 is set to R in SV[2])
• S1 receives the REQUEST. Accepts the REQUEST since SN1[2] is smaller than
the message sequence number.
Since SV1[1] is H: SV1[2] = R, TSV[2] = R, TSN[2] = 1, SV1[1] = N.
Send token to S2
• S2 receives the token. SV2[2] = E. After exiting the CS, SV2[2] = TSV[2] = N.
Updates SN, SV, TSN, TSV. Since nobody is REQUESTing, SV2[2] = H.
• Assume S3 makes a REQUEST now. It will be sent to both S1 and S2. Only S2
responds since only SV2[2] is H (SV1[1] is N now).
B. Prabhakaran 31
Raymond’s Algorithm
 Sites are arranged in a logical directed tree. Root: token holder. Edges:
directed towards root.
 Every site has a variable holder that points to an immediate neighbor
node, on the directed path towards root. (Root’s holder point to itself).
 Requesting CS
 If Si does not hold token and request CS, sends REQUEST upwards
provided its request_q is empty. It then adds its request to request_q.
 Non-empty request_q -> REQUEST message for top entry in q (if not
done before).
 Site on path to root receiving REQUEST -> propagate it up, if its
request_q is empty. Add request to request_q.
 Root on receiving REQUEST -> send token to the site that forwarded the
message. Set holder to that forwarding site.
 Any Si receiving token -> delete top entry from request_q, send token to
that site, set holder to point to it. If request_q is non-empty now, send
REQUEST message to the holder site.

B. Prabhakaran 32
Raymond’s Algorithm …
 Executing CS: getting token with the site at the top of
request_q. Delete top of request_q, enter CS.
 Releasing CS
 If request_q is non-empty, delete top entry from q, send token to
that site, set holder to that site.
 If request_q is non-empty now, send REQUEST message to the
holder site.
 Performance
 Average messages: O(log N) as average distance between 2 nodes
in the tree is O(log N).
 Synchronization delay: (T log N) / 2, as average distance between
2 sites to successively execute CS is (log N) / 2.
 Greedy approach: Intermediate site getting the token may enter CS
instead of forwarding it down. Affects fairness, may cause
starvation.
B. Prabhakaran 33
Raymond’s Algorithm: Example
Step 1: Token
S1 holder

Token
S2 S3
request

S4 S5 S6 S7

Step 2:
S1

S2 Token S3

S4 S5 S6 S7

B. Prabhakaran 34
Raymond’s Algm.: Example…
Step 3:
S1

S2 S3

S4 S5 S6 S7

Token
holder

B. Prabhakaran 35
Comparison
Non-Token Resp. Time(ll) Sync. Delay Messages(ll) Messages(hl)

Lamport 2T+E T 3(N-1) 3(N-1)

Ricart-Agrawala 2T+E T 2(N-1) 2(N-1)
Maekawa 2T+E 2T 3*sq.rt(N) 5*sq.rt(N)

Token Resp. Time(ll) Sync. Delay Messages(ll) Messages(hl)

Suzuki-Kasami 2T+E T N N
Singhal 2T+E T N/2 N
Raymond T(log N)+E Tlog(N)/2 log(N) 4

B. Prabhakaran 36
Distributed Deadlock Detection
• Assumptions:
• System has only reusable resources
• Only exclusive access to resources
• Only one copy of each resource
• States of a process: running or blocked
• Running state: process has all the resources
• Blocked state: waiting on one or more resource

B. Prabhakaran 1
Deadlocks
• Resource Deadlocks
• A process needs multiple resources for an activity.
• Deadlock occurs if each process in a set request resources
held by another process in the same set, and it must receive
all the requested resources to move further.

• Communication Deadlocks
• Processes wait to communicate with other processes in a set.
• Each process in the set is waiting on another process’s
message, and no process in the set initiates a message
until it receives a message for which it is waiting.

B. Prabhakaran 2
Graph Models
 Nodes of a graph are processes. Edges of a graph the
pending requests or assignment of resources.
 Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting
for a resource from P2.
 Transaction-wait-for Graphs (TWF): WFG in databases.
 Deadlock: directed cycle in the graph.
 Cycle example:

P1 P2

B. Prabhakaran 3
Graph Models
 Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting
for a resource from P2.

P1 R1

R2 P2

B. Prabhakaran 4
AND, OR Models
 AND Model
 A process/transaction can simultaneously request for multiple
resources.
 Remains blocked until it is granted all of the requested
resources.

 OR Model
 A process/transaction can simultaneously request for multiple
resources.
 Remains blocked till any one of the requested resource is
granted.

B. Prabhakaran 5
Sufficient Condition
Deadlock ??

P1 P2 P5

P4 P3 P6

B. Prabhakaran 6
AND, OR Models
 AND Model
 Presence of a cycle.

P1 P2 P1

P1 P1 P1

B. Prabhakaran 7
AND, OR Models
 OR Model
 Presence of a knot.
 Knot: Subset of a graph such that starting from any
node in the subset, it is impossible to leave the knot
by following the edges of the graph.

P1 P2 P5

P4 P3 P6

B. Prabhakaran 8
Deadlock Handling Strategies
 Deadlock Prevention: difficult
 Deadlock Avoidance: before allocation, check for
possible deadlocks.
 Difficult as it needs global state info in each site (that handles
resources).
 Deadlock Detection: Find cycles. Focus of discussion.
 Deadlock detection algorithms must satisfy 2 conditions:
 No undetected deadlocks.
 No false deadlocks.

B. Prabhakaran 9
Distributed Deadlocks
 Centralized Control
 A control site constructs wait-for graphs (WFGs) and checks
for directed cycles.
 WFG can be maintained continuously (or) built on-demand by
requesting WFGs from individual sites.
 Distributed Control
 WFG is spread over different sites.Any site can initiate the
deadlock detection process.
 Hierarchical Control
 Sites are arranged in a hierarchy.
 A site checks for cycles only in descendents.

B. Prabhakaran 10
Centralized Algorithms
 Ho-Ramamoorthy 2-phase Algorithm
 Each site maintains a status table of all processes initiated at that
site: includes all resources locked & all resources being waited on.
 Controller requests (periodically) the status table from each site.
 Controller then constructs WFG from these tables, searches for
cycle(s).
 If no cycles, no deadlocks.
 Otherwise, (cycle exists): Request for state tables again.
 Construct WFG based only on common transactions in the 2 tables.
 If the same cycle is detected again, system is in deadlock.
 Later proved: cycles in 2 consecutive reports need not result in a
deadlock. Hence, this algorithm detects false deadlocks.

B. Prabhakaran 11
Centralized Algorithms...
 Ho-Ramamoorthy 1-phase Algorithm
 Each site maintains 2 status tables: resource status table and
process status table.
 Resource table: transactions that have locked or are waiting for
resources.
 Process table: resources locked by or waited on by transactions.
 Controller periodically collects these tables from each site.
 Constructs a WFG from transactions common to both the
tables.
 No cycle, no deadlocks.
 A cycle means a deadlock.

B. Prabhakaran 12
Distributed Algorithms
 Path-pushing: resource dependency information
disseminated through designated paths (in the graph).
 Edge-chasing: special messages or probes circulated
along edges of WFG. Deadlock exists if the probe is
received back by the initiator.
 Diffusion computation: queries on status sent to process
in WFG.
 Global state detection: get a snapshot of the distributed
system. Not discussed further in class.

B. Prabhakaran 13
Edge-Chasing Algorithm
 Chandy-Misra-Haas’s Algorithm:
 A probe(i, j, k) is used by a deadlock detection process Pi. This
probe is sent by the home site of Pj to Pk.
 This probe message is circulated via the edges of the graph.
Probe returning to Pi implies deadlock detection.
 Terms used:
 Pj is dependent on Pk, if a sequence of Pj, Pi1,.., Pim, Pk
exists.
 Pj is locally dependent on Pk, if above condition + Pj,Pk on
same site.
 Each process maintains an array dependenti: dependenti(j) is
true if Pi knows that Pj is dependent on it. (initially set to
false for all i & j).

B. Prabhakaran 14
Chandy-Misra-Haas’s Algorithm
Sending the probe:
if Pi is locally dependent on itself then deadlock.
else for all Pj and Pk such that
(a) Pi is locally dependent upon Pj, and
(b) Pj is waiting on Pk, and
(c ) Pj and Pk are on different sites, send probe(i,j,k) to the home
site of Pk.

Receiving the probe:

if (d) Pk is blocked, and
(e) dependentk(i) is false, and
(f) Pk has not replied to all requests of Pj,
then begin
dependentk(i) := true;
if k = i then Pi is deadlocked
else ...
B. Prabhakaran 15
Chandy-Misra-Haas’s Algorithm
Receiving the probe:
…….
else for all Pm and Pn such that
(a’) Pk is locally dependent upon Pm, and
(b’) Pm is waiting on Pn, and
(c’) Pm and Pn are on different sites, send probe(i,m,n)
to the home site of Pn.
end.

Performance:
For a deadlock that spans m processes over n sites, m(n-1)/2 messages
are needed.
Size of the message 3 words.
Delay in deadlock detection O(n).

B. Prabhakaran 16
C-M-H Algorithm: Example
P0 P2
P1
P3

probe(1,3,4)
probe(1,7,1)

P4
P7
P6 P5

B. Prabhakaran 17
Diffusion-based Algorithm
Initiation by a blocked process Pi:
send query(i,i,j) to all processes Pj in the dependent set DSi of Pi;
num(i) := |DSi|; waiti(i) := true;

Blocked process Pk receiving query(i,j,k):

if this is engaging query for process Pk /* first query from Pi */
then send query(i,k,m) to all Pm in DSk;
numk(i) := |DSk|; waitk(i) := true;
else if waitk(i) then send a reply(i,k,j) to Pj.

Process Pk receiving reply(i,j,k)

if waitk(i) then
numk(i) := numk(i) - 1;
if numk(i) = 0 then
if i = k then declare a deadlock.
else send reply(i, k, m) to Pm, which sent the engaging query.

B. Prabhakaran 18
Diffusion Algorithm: Example
reply(1,6,2) query
P2 reply
P1
P3

reply(1,1,7)
query(1,3,4)
query(1,7,1)

P4
P7
P6 P5

B. Prabhakaran 19
Engaging Query
 How to distinguish an engaging query?
 query(i,j,k) from the initiator contains a unique sequence
number for the query apart from the tuple (i,j,k).
 This sequence number is used to identify subsequent queries.
 (e.g.,) when query(1,7,1) is received by P1 from P7, P1 checks
the sequence number along with the tuple.
 P1 understands that the query was initiated by itself and it is not
an engaging query.
 Hence, P1 sends a reply back to P7 instead of forwarding the
query on all its outgoing links.

B. Prabhakaran 20
AND, OR Models
 AND Model
 A process/transaction can simultaneously request for multiple
resources.
 Remains blocked until it is granted all of the requested resources.
 Edge-chasing algorithm can be applied here.
 OR Model
 A process/transaction can simultaneously request for multiple
resources.
 Remains blocked till any one of the requested resource is granted.
 Diffusion based algorithm can be applied here.

B. Prabhakaran 21
Hierarchical Deadlock Detection
• Follows Ho-Ramamoorthy’s 1-phase algorithm. More than 1 control site
organized in hierarchical manner.
• Each control site applies 1-phase algorithm to detect (intracluster) deadlocks.
• Central site collects info from control sites, applies 1-phase algorithm to
detect intracluster deadlocks.

Control
site
Central Site

Control Control
site site

B. Prabhakaran 22
Persistence & Resolution
 Deadlock persistence:
 Average time a deadlock exists before it is resolved.
 Implication of persistence:
 Resources unavailable for this period: affects utilization
 Processes wait for this period unproductively: affects response time.
 Deadlock resolution:
 Aborting at least one process/request involved in the deadlock.
 Efficient resolution of deadlock requires knowledge of all processes
and resources.
 If every process detects a deadlock and tries to resolve it
independently -> highly inefficient ! Several processes might be
aborted.

B. Prabhakaran 23
Deadlock Resolution
 Priorities for processes/transactions can be useful for
resolution.
 Consider priorities introduced in Obermarck’s algorithm.
 Highest priority process initiates and detects deadlock (initiations
by lower priority ones are suppressed).
 When deadlock is detected, lowest priority process(es) can be
aborted to resolve the deadlock.
 After identifying the processes/requests to be aborted,
 All resources held by the victims must be released. State of
released resources restored to previous states. Released resources
granted to deadlocked processes.
 All deadlock detection information concerning the victims must be
removed at all the sites.

B. Prabhakaran 24
Distributed File System
 File system spread over multiple, autonomous computers.
 A distributed file system should provide:
 Network transparency: hide the details of where a file
is located.
 High availability: ease of accessibility irrespective of
the physical location of the file.
 This objective is difficult to achieve because the distributed
file system is vulnerable to problems in underlying networks
as well as crashes of systems that are the “file sources”.
 Replication / mirroring can be used to alleviate the above
problem.
 However, replication/mirroring introduces additional issues
such as consistency.

B. Prabhakaran 1
DFS: Architecture
 In general, files in a DFS can be located in “any” system.
We call the “source(s)” of files to be servers and those
accessing them to be clients.
 Potentially, a server for a file can become a client for
another file.
 However, most distributed systems distinguish between
clients and servers in more strict way:
 Clients simply access files and do not have/share local files.
 Even if clients have disks, they (disks) are used for swapping,
caching, loading the OS, etc.
 Servers are the actual sources of files.
 In most cases, servers are more powerful machines (in terms of
CPU, physical memory, disk bandwidth, ..)

B. Prabhakaran 2
DFS: Architecture …
… … …

Server Server …. Server

Computer Network

Client Client

B. Prabhakaran 3
DFS Data Access
Request to
Access data
Return data Load data Load server
to client to client cache cache
Check
client
Data Issue disk
cache
present read
Data
Not present Data
Not present
Check Check
Local disk Server cache
Data
(if any) Data
present
present

Data Send request to

Not present File server Network
B. Prabhakaran 4
Mechanisms for DFS
 Mounting: to help in combining files/directories in different
systems and form a single file system structure.
 Caching: to reduce the response time in bringing data from
remote machines.
 Hints: modified caching
 Bulk data transfer: helps in reducing the delay due to transfer
of files over the network. Bulk:
 Obtain multiple number of blocks with a single seek
 Format, transfer large number of packets in a single context
switch.
 Reduce the number of acknowledgements to be sent.
 (e.g.,) useful when downloading OS onto a diskless client.
 Encryption: Establish a key for encryption with the help of
an authentication server.
B. Prabhakaran 5
Mounting
 Mounting helps to build a hierarchy of file directories.
 A collection of files can be mounted at an internal node of
the hierarchy.
 Node at which this collection of files is mounted: mount
point.
 Operating systems kernel maintains a structure called the
mount table, mapping mount points to appropriate storage
devices.
 Mount table can be maintained at:
 Each client. Employed in Sun Network File System (NFS).
 Servers. All clients see the same file system structure. Employed
in Sprite file system.

B. Prabhakaran 6
Name Space Hierarchy
Server X
Root (/)

Mount
Points
a c
b

Server Y Server Z
g
d e f h i

B. Prabhakaran 7
Caching
 Performance of distributed file system, in terms of response
time, depends on the ability to “get” the files to the user.
 When files are in different servers, caching might be needed
to improve the response time.
 A copy of data (in files) is brought to the client (when
referenced). Subsequent data accesses are made on the client
cache.
 Client cache can be on disk or main memory.
 Data cached may include future blocks that may be
referenced too.
 Caching implies DFS needs to guarantee consistency of data.

B. Prabhakaran 8
Hints
 Hints can be used when cached data need not be
completely be accurate.
 Example: Mapping of the name of a file/directory to the
actual physical device. The address/name of device can be
stored as a hint.
 If this address fails to access the requested file, the cached
data can be purged.
 The file server can refer to a name server, determine the
actual location of file/directory, and update the cache.
 In hints, a cache is neither updated nor invalidated when a
change occurs to the content.

B. Prabhakaran 9
Design Issues
 Naming: Locating the file/directory in a DFS based on
name.
 Location of cache: disk, main memory, both.
 Writing policy: Updating original data source when cache
content gets modified.
 Cache consistency: Modifying cache when data source
gets modified.
 Availability: More copies of files/resources.
 Scalability: Ability to handle more clients/users.
 Semantics: Meaning of different operations (read, write,
…)

B. Prabhakaran 10
Naming
 Name space: (e.g.,) /home/students/jack, /home/staff/jill.
 Name space is a collection of names.
 Location transparency: file names do not indicate their
physical locations.
 Name resolution: mapping name space to an
object/device/file/directory.
 Naming approaches:
 Simple Concatenation: add hostname to file names.
 Guarantees unique names.

 No transparency. Moving a file to another host involves a

file name change.

B. Prabhakaran 11
Naming: Approaches ...
 Naming approaches:
 .....
 Mounting: mount remote directories to local ones. Location
transparent after mounting. (followed in Sun NFS).
 Example: /students is mounted at /home.

 Remember: different clients in the system can mount in

different ways. (e.g.,) In client 1: mount /students at /. i.e.,

/students/jack, /students/jill. In client 2: mount /students at /usr,
i.e., /usr/students/jack, /usr/students/jill.
 Single Global Directory:all files in the system belong to a single
name space. (followed in Sprite OS).
 System wide unique names, i.e., all clients mount the same way.

 Difficult to enforce this restriction. Can work only among

(highly) cooperating systems (or system administrators !)

B. Prabhakaran 12
Naming: Context
 Context: identifying the name space within which name
resolution is to be done.
 Example: context using ~ (tilde).
 ~jill/t: /home/staff/jill/t
 ~john/t: /home/students/john/t
 ~name: represents the directory structure associated with a
person or a project.
 Whenever file “t” is accessed, it is interpreted with reference
to ~’s environment.
 ~ helps when different clients mount in different ways, still
sharing the same of users and their home directories.
 (e.g.,) ~john may be mapped to /home/students/john in client
1 and to /usr/students/john in client 2.

B. Prabhakaran 13
Name Resolution
 Done by name servers that map file names to actual files.
 Centralized name server: send names to the server and get
the path of servers+devices that lead to the requested file.
 Name server becomes a bottle neck.
 Distributed name server: (e.g.,) consider access to a file
/a/b/c/d/e
 Local name server identifies the remote server that handles the
part /b/c/d/e
 This procedure may be recursively done till ../e is resolved.

B. Prabhakaran 14
Caching
 In main memory:
 Faster than disks.
 Diskless workstations can also cache.
 Server-cache is in main memory -> same design can be used in
clients also.
 Disadvantage: clients need main memory for virtual memory
management too.
 In disks:
 Large files can be cached.
 Virtual memory management is straight forward.
 After caching the necessary files, the client can get
disconnected from network (if needed, for instance, to help its
mobility).

B. Prabhakaran 15
Writing Policy
 When should a modified cache content be transferred to
the server?
 Write-through policy:
 Immediate writing at server when cache content is modified.
 Advantage: reliability, crash of cache (client) does not mean loss
of data.
 Disadvantage: Several writes for each small change.
 Delayed writing policy:
 Write at the server, after a delay.
 Advantage: small/frequent changes do not increase network
traffic.
 Disadvantage: less reliable, susceptible to client crashes.
 Write at the time of file closing.

B. Prabhakaran 16
Cache Consistency
 When should a modified source content be transferred to the
cache?
 Server-initiated policy:
 Server cache manager informs client cache managers that can then
retrieve the data.
 Client-initiated policy:
 Client cache manager checks the freshness of data before delivering to
users. Overhead for every data access.
 Concurrent-write sharing policy:
 Multiple clients open the file, at least one client is writing.
 File server asks other clients to purge/remove the cached data for the
file, to maintain consistency.
 Sequential-write sharing policy: a client opens a file that was
recently closed after writing.

B. Prabhakaran 17
Cache Consistency ...
 Sequential-write sharing policy: a client opens a file that
was recently closed after writing.
 This client may have outdated cache blocks of the file (since the
other client might have modified the file contents).
 Use time stamps for both cache and files. Compare the time

stamps to know the freshness of blocks.

 The other client (which was writing previously) may still have
modified data in its cache that has not yet been updated on
server. (e.g.,) due to delayed writing.
 Server can force the previous client to flush its cache

whenever a new client opens the file.

B. Prabhakaran 18
Availability
 Intention: overcome the failure of servers or network
links.
 Solution: replication, i.e., maintain copies of files at
different servers.
 Issues:
 Maintaining consistency
 Detecting inconsistencies, if they happen despite best efforts.
Possible reasons for such inconsistencies:
 Replica is not updated due to a server failure or a broken

network link.
 Inconsistency problems and their recovery may reduce the
benefit of replication.

B. Prabhakaran 19
Availability: Replication
 Unit of replication: is mostly a file.
 Replicas of a file in a directory may be handled by different
servers, requiring extra name resolutions to locate the replicas.
 Replication unit: group of files:
 Advantage: process of name resolution, etc., to locate replicas
can be done for a set of files and not for individual files.
 Disadvantage: wasteful of disk space if only very few of this
group of files is needed by users often.

B. Prabhakaran 20
Replica Management
 Two-phase commit protocols can be used to update all
replicas.
 Other schemes:
 Weighted votes:
 A certain number of votes r or w is to be obtained before

reading or writing.
 Current synchronization site (CSS):
 Designate a process/site to control the modifications.

 File open/close are done through CSS.

 CSS can become a bottleneck.

B. Prabhakaran 21
Scalability
 Ease of adding more servers and clients with respect to the
problems / design issues discussed before such as caching,
replication management, etc.
 Server-initiated cache invalidation scales up better.
 Using the clients cache:
 A server serves only X clients.
 New clients (after the first X) are informed of the X clients from
whom they can get the data (sort of chaining/hierarchy).
 Cache misses & invalidations are propagated up and down this
hierarchy, i.e., each node serves as a mini-file server for its
children.
 Structure of a server:
 I/O operations through threads (light weight processes) can help in
handling more clients.

B. Prabhakaran 22
Semantics
 What is the effect / meaning of an operation?
 (e.g.,) read returns the data due to latest write operation.
 Guaranteeing the above semantics in the presence of
caching can be difficult.
 We saw techniques for these under caching.

B. Prabhakaran 23
Case Study: Sun NFS
 Major goal: keep the distributed file system independent of
underlying hardware and operating system.
 NFS (Network File System): uses the Remote Procedure
Call (RPC) for remote file operations.
 Virtual file system (VFS) interface: provides uniform,
virtual file operations that are mapped to the actual file
system. (e.g.,) VFS can be mapped to DOS, so NFS can
work with PCs.
 VFS uses a structure called vnode (virtual node) that is
unique in a NFS.
 Each vnode has a mount table that provides a pointer to its
parent file system and to the system over which it is
mounted.

B. Prabhakaran 24
Sun NFS...
 A vnode can be a mount point.
 Using mount tables, VFS interface can distinguish
between local and remote file systems.
 Requests to remote files are routed to the NFS by the VFS
interface.
 RPCs are used to reach remote VFS interface.
 Remote VFS invokes appropriate local file operation.

B. Prabhakaran 25
Sun NFS Architecture
Client

Kernel

OS Interface Server

VFS Interface Server

VFS Interface
Routines

Others Unix NFS

Disks

Disk RPC/XDR RPC/XDR

Network

B. Prabhakaran 26
NFS: Naming & Location
 Each client can configure its file system independent of
others. i.e., different clients can see different name spaces.
 Name resolution example:
 Look up for a/b/c. a corresponds to vnode1 (assume).
 Look up on vnode1/b returns vnode2 that might say the object is
on server X.
 Look up on vnode2/c is sent to X. X returns a file handle (if the
file exists, permission matches, etc).
 File handle is used for subsequent file operations.
 Name resolution in NFS is an iterative process (slow).
 Name space information is not maintained at each server as
the servers in NFS are stateless (to be discussed later).

B. Prabhakaran 27
NFS: Caching
 NFS Client Cache:
 File blocks: cached on demand.
 Employs read ahead. Large block sizes (8 Kbytes) for data transfer

to improve the sequential read performance.

 Entire files cached, if they are small. Timestamps of files are also

cached.
 Cached blocks are valid for certain period after which validation is

needed from server. Validation done by comparing time stamps of

file at server.
 Delayed writing policy used. Modified files are flushed after

closing to handle sequential-write sharing.

 File name to vnode translations: directory name lookup cache holds the
vnodes for remote directory names.
 Cache updated when lookup fails (cache acts as hints).

 Attributes of files & directories:

B. Prabhakaran 28
NFS: Caching
 NFS Client Cache: ....
 Attributes of files & directories:
 Attribute inquiries form 90% of calls made to servers.

 Cache entries are updated every time new attributes are

received from server.

 File attributes are discarded after 3 seconds and directory

attributes after 30 seconds.

B. Prabhakaran 29
NFS: Stateless Server
 NFS servers are stateless to help crash recovery.
 Stateless: no record of past requests (e.g., whether file is open,
position of file pointer, etc.,).
 Client requests contain all the needed information. No
response, client simply re-sends the request.
 After a crash, a stateless server simply restarts. No need to:
 Restore previous transaction records.
 Update clients or negotiate with clients on file status.
 Disadvantages:
 Client message sizes are larger.
 Server cache management difficult since server has no idea on which
files have been opened/closed.
 Server can provide little information for file sharing.

B. Prabhakaran 30
Un/mounting in NFS
 Mounting of files in Unix is done by using a mount table
stored in a file: /etc/mnttab.
 mnttab is read by programs using procedures such as
getmntent.
 mount command adds an entry in mnttab, i.e., every time a
file system is mounted in the system.
 umount command removes an entry in mnttab, i.e., every
time a file system is unmounted from the system.

B. Prabhakaran 31
Un/Mounting
 First entry in mnttab: file system that was mounted first.
 Usually, file systems get mounted at boot time.
 Mount: term used for mounting tapes onto systems, I guess.
 Each entry is a line of fields separated by spaces in the form:
<special> <mount_point> <fstype> <options> <time>
 <special>: The name of the resource to be mounted.
 <mount_point> : pathname of the directory on which the filesystem is
mounted.
 <fstype> : file system type of the mounted file system.
 <options> : mount options.
 <time> : time at which the file system was mounted.
 Entries for <special>: path-name of a block-special device (e.g.,
/dev/fd0), the name of a remote filesystem (casa:/export/home, i.e.,
host:pathname), or the name of a swap file.

B. Prabhakaran 32
Sharing Filesystems
 In SunOS, share command is used to specify the file systems that can be
mounted by other systems.
 (e.g.), share [ -F FSType ] [ -o specific_options ] [-d
description ] [ pathname ]
 Share command makes a resource available to remote system, through a
file system of FSType.
 <specific_options> : control access of the shared resource.
 rw pathname is shared read/write to all clients. This is also the default
behavior.
 rw=client[:client]...pathname is shared read/write only to the listed clients.
No other systems can access pathname.
 ro pathname is shared read-only to all clients.
 ro=client[:client]... pathname is shared read-only only to the listed
clients. No other systems can access pathname.

B. Prabhakaran 33
Sharing Filesystems…
 <-d description>: -d flag may be used to provide a description of the
resource being shared.
 Example : To share the /disk file system read-only at boot time.
 share -F nfs -o ro /disk
 share -F nfs -o rw=usera:userb /somefs
 Multiple share commands on same file system? : Last command
supersedes.
 Try:
 /etc/dfs/dfstab: list of share commands to be executed at boot time
 /etc/dfs/fstypes: list of file system types, NFS by default
 /etc/dfs/sharetab: system record of shared file systems.

B. Prabhakaran 34
Automounting
 mount a remote file system only when it is accessed, perhaps for a
guessed duration of time.
 automount utility: installs autofs mount points and associates an
automount map with each mount point.
 autofs file system monitors attempts to access directories within it and
notifies the automountd daemon.
 automountd uses the map to locate a file system. Then mounts at the
point of reference within the autofs file system.
 A map can be assigned to an autofs mount using an entry in the
/etc/auto_master map or a direct map.
 File system is not accessed within an appropriate interval (10
minutes by default) ? : the automountd daemon unmounts the file
system.

B. Prabhakaran 35
Cluster File System
 System Model: a set of storage devices that can be
accessed by a set of workstations.

System 1 System n

Very High Speed Network

RAID Tapes/CDs RAID

RAID: Redundant Array of
Inexpensive Disks
B. Prabhakaran 36
Cluster File System
 Storage devices can be viewed as a “pool of centralized
resources”.
 Storage devices are shared by a set of workstations/systems
or a cluster as it is called.
 Both the pool of storage and the cluster are attached to very
high speed networks (typically optical networks).
 Devices can be mounted to different systems: e.g., Raid1 to
system n, Raid 2 to system 1 etc.
 Features:
 Mirroring: replication of entire disks
 Striping: data (e.g., multimedia) spread over multiple disks
 Online reconfiguration: add/delete storage devices dynamically
 Assign/remove devices to applications/systems dynamically

B. Prabhakaran 37
Storage Virtualization
 Means logical representation of the physical resources:
storage devices & workstations
 Virtualization specifies details such as which devices are
meant for which host, how they can be shared, etc.
 Possible places for virtualization: (each choice has its own
advantages and disadvantages)
 Workstations or hosts
 Volume managers (software) are run on hosts,

providing control over how data is stored and

accessed over the different devices.

B. Prabhakaran 38
Storage Virtualization...
 Possible places for virtualization:
 In storage subsystem
 Associated with large-scale RAID large subsystems

(many terabytes). Virtualization services embedded

on storage controllers.
 In special appliances: “in-band” or “out-of-band”
 Special, intelligent appliances are used to provide

virtualization
 Appliance name: NAS (Network Attached Storage)

 In-band: NAS is part of storage pool

 Out-of-band: NAS not a part of storage pool

B. Prabhakaran 39
Veritas Volume Manager
 Works on both Unix and Windows
 Builds a diskgroup spanning multiple devices.
 Dynamic diskgroups management
 Striping of data on multiple RAIDs.
 Striping distributes data on multiple disks and hence
increases the disk bandwidth for retrieval. Suitable for
multimedia data.
 Cluster Volume Manager:
 Allows a volume to be simultaneously mounted for use across
multiple servers for both reads and writes.

B. Prabhakaran 40
Veritas Cluster Server
 Cluster server handles upto 32 systems.
 It monitors, controls and restarts applications in response
to a variety of task.
 (e.g.,) application A1 may be started on system n is system
1 fails. Disk group D1 will be automatically assigned to
system n.
 (e.g.) Disk group D2 may be assigned to system 1 if D1
fails and the application A1 will continue.

S1 Sn

D1 D2
B. Prabhakaran 41
Service Groups
 A set of resources working together to provide application
services to clients.
 Service group example:
 Disk groups having data
 Volume built using disk group
 File system (directories) using the volume
 Servers/systems providing the application
 Application program + libraries
 Types of Service Groups:
 Failover Groups: runs on 1 system in a cluster at a time. Used for
applications that are not designed to maintain data consistency on
multiple copies.
 Cluster server monitors the heart beat of the system. If it fails,
the backup is brought on-line.

B. Prabhakaran 42
Service Groups...
 Types of Service Groups...:
 Parallel groups: run concurrently on more than 1 system.
 Time-to-recovery:
 On a failure, an application service is moved to another server in
the cluster.
 Disk groups are de-imported from the crashed server and
imported by the back-up server.
 Volume manager helps to manage the disk group ownership and
accelerate recovery process of the cluster.
 New ownership properties are broadcast to the cluster to ensure
data security.
 Time to take to bring the back-up online.

B. Prabhakaran 43
Disaster Tolerance
 More than 1 cluster connected by very high speed networks
over a wide area network.
 Cluster 1 and 2 geographically distributed.

Very High Speed Link

Over a Wide Area Network

Cluster 1

Cluster 2

B. Prabhakaran 44
Veritas Volume Replicator
 Redundant copy of application in another cluster must be
kept up-to-date.
 Volume Replicator allows a disk group to be replicated at 1
or more remote clusters.
 Initialization of replication: entire disk group is replicated.
 Runtime: only modifications to data are communicated.
Conserves network bandwidth.
 Disk groups at the remote cluster are not usually active.
 Identical instance of application is run on the remote cluster
in idle mode.
 Disaster is identified by volume replicator using heart beats.
 Puts remote cluster on-line for the applications.
 Time-to-recovery: less than 1 minute.

B. Prabhakaran 45
Distributed Shared Memory
 DSM provides a virtual address space that is shared among
all nodes in the distributed system.
 Programs access DSM just as they do locally.
 An object/data can be owned by a node in DSM. Initial
owner can be the creator of the object/data.
 Ownership can change when data moves to other nodes.
 A process accessing a shared object gets in touch with a
mapping manager who maps the shared memory address to
the physical memory.
 Mapping manager: a layer of software, perhaps bundled
with the OS or as a runtime library routine.

B. Prabhakaran 1
Distributed Shared Memory...
Node 1 Node 2 Node n

Memory Memory Memory

Shared Memory

B. Prabhakaran 2
DSM Advantages
 Parallel algorithms can be written in a transparent manner
using DSM. Using message passing (e.g., send, receive),
the parallel programs might become even more complex.
 Difficult to pass complex data structures with message
passing primitives.
 Entire block/page of memory along with the reference
data /object can be moved. This can help in easier
referencing of associated data.
 DSM cheaper to build compared to tightly coupled
multiprocessor systems.

B. Prabhakaran 3
DSM Advantages…
 Fast processors and high speed networks can help in
realizing large sized DSMs.
 Programs using large DSMs may not need as many disk swaps as
in the case of local memory usage.
 This can offset the overhead due to communication delay in
DSMs.
 Tightly coupled multiprocessor systems access main
memory via a common bus. So the number of processors
limited to a few tens. No such restriction in DSM systems.
 Programs that work on multiprocessor systems can be
ported or directly work on DSM systems.

B. Prabhakaran 4
DSM Algorithms
 Issues:
 Keeping track of remote data locations
 Overcoming/reducing communication delays and protocol
overheads when accessing remote data.
 Making shared data concurrently available to improve
performance.
 Types of algorithms:
 Central-server
 Data migration
 Read-replication
 Full-replication

B. Prabhakaran 5
Central Server Algorithm
Central Server

Data
Access
Requests

Clients

• Central server maintains all the shared data.

• It services read/write requests from clients, by returning data items.
• Timeouts and sequence numbers can be employed for retransmitting
requests (which did not get responses).
• Simple to implement, but central server can become a bottleneck.

B. Prabhakaran 6
Migration Algorithm
Data Access Request

Node i Node j

Data Migration

• Data is shipped to the requesting node, allowing subsequent accesses

to be done locally.
• Typically, whole block/page migrates to help access to other data.
• Susceptible to thrashing: page migrates between nodes while serving
only a few requests.
• Alternative: minimum duration or minimum accesses before a page
can be migrated (e.g., the Mirage system).

B. Prabhakaran 7
Migration Algorithm …
 Migration algorithm can be combined with virtual memory.
 (e.g.,) if a page fault occurs, check memory map table.
 If map table points to a remote page, migrate the page
before mapping it to the requesting process’s address space.
 Several processes can share a page at a node.
 Locating remote page:
 Use a server that tracks the page locations.
 Use hints maintained at nodes. Hints can direct the search for a
page toward the node holding the page.
 Broadcast a query to locate a page.

B. Prabhakaran 8
Read-replication Algorithm
• Extend migration algorithm: replicate data at multiple nodes for read
access.
• Write operation:
• invalidate all copies of shared data at various nodes.
• (or) update with modified value
Data Access Request
Write Operation
in Read-replication
Algorithm : Node i Node j

Data Replication
Invalidate

• DSM must keep track of the location of all the copies of shared data.
• Read cost low, write cost higher.

B. Prabhakaran 9
Full-replication Algorithm
• Extension of read-replication algorithm: allows multiple sites to have both
read and write access to shared data blocks.
• One mechanism for maintaining consistency: gap-free sequencer.

Sequencer

Write Operation
Update
in Full-replication Write
Multicast
Algorithm : Requests

Clients

B. Prabhakaran 10
Full-replication Sequencer
 Nodes modifying data send request (containing
modifications) to sequencer.
 Sequencer assigns a sequence number to the request and
multicasts it to all nodes that have a copy of the shared data.
 Receiving nodes process requests in order of sequence
numbers.
 Gap in sequence numbers? : modification requests might be
missing.
 Missing requests will be reported to sequencer for
retransmission.

B. Prabhakaran 11
Memory Coherence
 Coherence: value returned by a read operation is the one
expected by a programmer (e.g., the value of the latest write
operation).
 Strict Consistency: a read returns the most recently written
value.
 Requires total ordering of requests which implies significant
overhead for mechanisms such as synchronization.
 Sometimes strict consistency may not be needed.
 Sequential Consistency: result of any execution of the
operations of all processors is the same as if they were
executed in a sequential order + the operations of each
processor appear in this sequence, in the order specified by
the program.
B. Prabhakaran 12
Memory Coherence…
 General Consistency: all copies of a memory location have
the same data after all writes of every processor is over.
 Processor Consistency: Writes issued by a processor
observed in the same order in which they were issued.
(ordering among any 2 processors may be different).
 Weak Consistency: Synchronization operations are
guaranteed to be sequentially consistent.
 i.e., treat shared data as critical sections. Use synchronization/
mutual exclusion techniques to access shared data.
 Maintaining consistency: programmer’s responsibility.
 Release Consistency: synchronization accesses are only
processor consistent with respect to each other.

B. Prabhakaran 13
Coherence Protocols
 Write-invalidate Protocol: invalidate all copies except the
one being modified before the write can proceed.
 Once invalidated, data copies cannot be used.
 Disadvantages:
 invalidation sent to all nodes regardless of whether the nodes
will be using the data copies.
 Inefficient if many nodes frequently refer to the data: after
invalidation message, there will be many requests to copy the
updated data.
 Used in several systems including those that provide strict
consistency.
 Write-update Protocol: causes all copies of shared data to be
updated. More difficult to implement, guaranteeing
consistency may be more difficult. (reads may happen in
between write-updates).
B. Prabhakaran 14
Granularity & Replacement
 Granularity: size of the shared memory unit.
 For better integration of DSM and local memory management:
DSM page size can be multiple of the local page size.
 Integration with local memory management provides built-in
protection mechanisms to detect faults, to prevent and and recover
from inappropriate references.
 Larger page size:
 More locality of references.
 Less overhead for page transfers.
 Disadvantage: more contention for page accesses.
 Smaller page size:
 Less contention. Reduces false sharing that occurs when 2
different data items are not shared by 2 different processors
but contention occurs as they are on same page.

B. Prabhakaran 15
Granularity & Replacement…
 Page replacement needed as physical/main memory is
limited.
 Data may be used in many modes: shared, private, read-
only, writable,…
 Least Recently Used (LRU) replacement policy cannot be
directly used in DSMs supporting data movement. Modified
policies more effective:
 Private pages may be removed ahead of shared ones as shared
pages have to be moved across the network
 Read-only pages can be deleted as owners will have a copy
 A page to be replaced should not be lost for ever.
 Swap it onto local disk.
 Send it to the owner.
 Use reserved memory in each node for swapping.

B. Prabhakaran 16
Case Studies
 Cache coherence protocol:
 PLUS System
 Munin System
 General Distributed Shared Memory:
 IVY (Integrated Shared Virtual Memory at Yale)

B. Prabhakaran 17
PLUS System
 PLUS system: write-update protocol and supports general
consistency.
 Memory Coherence Manager (MCM) manages cache.
 Unit of replication: a page (4 Kbytes); unit of memory
access and coherence maintenance: one 32-bit word.
 A virtual page corresponds to a list of replicas of a page.
 One of the replica is designated as master copy.
 Distributed link list (copy-list) identifies the replicas of a
page. Copy-list has 2 pointers
 Master pointer
 Next-copy pointer

B. Prabhakaran 18
PLUS: RW Operations
 Read fault?:
 If address points to local memory, read it. Otherwise, local MCM
sends a read request to its counterpart at the specified remote
node.
 Data returned by remote MCM passed back to the requesting
processor.
 Write operation:
 Always on master copy then propagated to copies linked by the
copy-list.
 Write fault: memory location for write, not local.
 On write fault: update request sent to the remote node pointed to
by MCM.
 If the remote node does not have the master copy, update request
sent to the node with master copy and for further propagation.
B. Prabhakaran 19
PLUS Write-update Protocol
Master Next-copy Master Next-copy Master Next-copy
X X X =1 on Nil
=1 on 2 =1 on 3

2
3 2 7
6
4 5 8
4
6
Node 1 Node 3
Node 2
1. MCM sends write req
to node 2.
1 8
2. Update message to
master node
X Node 2 Page p
3. MCM updates X
4. Update message to next 7. Update X
copy. 1 8. MCM sends ack:
5. MCM updates X Update complete.
6. Update message to next Node 4
copy
B. Prabhakaran 20
PLUS: Protocol
 Node issuing write is not blocked on write operation.
 However, a read on that location (being written into) gets
blocked till the whole update is completed. (i.e., remember
pending writes).
 Strong ordering within a single processor independent of
replication (in the absence of concurrent writes by other
processors), but not with respect to another processor.
 write-fence operation: strong ordering with synchronization
among processors. MCM waits for previous writes to
complete.

B. Prabhakaran 21
Munin System
 Use application-specific semantic information to classify shared
objects. Use class-specific handlers.
 Shared object classes:
 Write-once objects: written at the start, read many times after that.
Replicated on-demand, accessed locally at each site. Large object?
Portions can be replicated instead of whole object.
 Private objects: accessed by a single thread. Not managed by coherence
manager unless accessed by a remote thread.
 Write-many objects: modified by multiple threads between
synchronization points. Munin employs delayed updates. Updates are
propagated only when thread synchronizes. Weak consistency.
 Result objects: Assumption is concurrent updates to different parts of a
result object will not conflict and object is not read until all parts are
updated -> delayed update can be efficient.
 Synchronization objects: (e.g.,) distributed locks for giving exclusive
access to data objects.

B. Prabhakaran 22
Munin System…
 Migratory objects: accessed in phases where each phase is a series
of accesses by a single thread: lock + movement, i.,e., migrate to the
node requesting lock.
 Producer-consumer objects: written by 1 thread, read by another.
Strategy: move the object to the reading thread in advance.
 Read-mostly object: i.e., writes are infrequent. Use broadcasts to
update cached objects.
 General read-write objects: does not fall into any of the above
categories: Use Berkeley ownership protocol supporting strict
consistency. Objects can be in states such as:
 Invalid: no useful data.
 Unowned: has valid data. Other nodes have copies of the object and the
object cannot be updated without first acquiring ownership.
 Owned exclusively: Can be updated locally. Can be replicated on-
demand.
 Owned non-exclusively: Cannot be updated before invalidating other
copies.

B. Prabhakaran 23
IVY
 IVY: Integrated Shared Virtual Memory at Yale. On Apollo
DOMAIN environment on a token ring network.
 Granularity of access page: 1KByte.
 Address space of a processor: shared virtual memory +
private space.
 On a page fault: if the page is not local, it is acquired
through remote memory request and made available to other
processes in the node as well.
 Coherence protocol: multiple readers-single writer
semantics Using writer-invalidation protocol, all read-only
copies are invalidated before writing.

B. Prabhakaran 24
Write-invalidation Protocol
 Processor i has a write-fault on page p:
 i finds owner of p
 p’s owner sends the page + copyset (i.e., the processors having
read-only copies). Owner marks its page table entry of p as nil.
 i sends invalidation messages to all processors in the copyset.
 Processor i has a read fault to a page p:
 i finds the owner of p.
 p’s owner sends p to i and adds i to the copyset of p.
 Implementation schemes:
 Centralized manager scheme
 Fixed distributed manager scheme
 Dynamic distributed manager scheme
 Above schemes differ on how the owner of a page is
identified

B. Prabhakaran 25
Centralized Manager
 Central manager on a node maintains all page ownership
information.
 Page faulting processor contacts the central manager and
requests a page copy.
 Central manager forwards the request to the owner. For
write operations, central manager updates the page
ownership information to the requestor.
 Page owner sends a page copy to the faulting processor.
 For reads, the faulting processor is added to copyset.
 For writes, owner is faulting processor now.
 Central manager requires 2 messages to locate the page
owner.

B. Prabhakaran 26
Fixed Distributed Manager
 Every processor keeps track of a pre-determined set of pages
(determined by a hashing/mapping function H).
 Processor i faults on p: i contacts processor H(p) for a copy
of the page.
 Rest of the steps proceeds as in centralized manager.
 In both the above schemes, concurrent access requests to a
page are serialized at the site of a manager.

B. Prabhakaran 27
Dynamic Distributed Manager
 Every host keeps track of the page ownership in its local
page table.
 Page table has a column probowner (probable owner) whose
value can either be the true owner or a probable one (i.e., it
is used as a hint). Initially set to a default value.
 On a page fault, request is sent to i (assume). If i is the
owner, steps proceed as in the case of central manager.
 If not, request is forwarded to probowner of p at i. This is
done until the actual owner is reached.
 probowner is updated on receipt of:
 Invalidation requests, ownership relinquishing messages.
 Receiving or forwarding of a page:
 For writes, the receiver is the owner. Forwarder updates the owner as
the receiver.

B. Prabhakaran 28
Dynamic Distributed Manager…
Processor 1 Processor 2 Processor 3 Processor M
Page Prob- Page Prob- Page Prob- Page Prob-
owner owner owner owner
1 1 1 1 1 2 1 2
2 3 2 3 2 3 2 3
Page Page
3 2 3 3 3 3 3 2
Req. Req.
. . . .... .
. . . .
. . . .
. . . .

n k n k n k n k

B. Prabhakaran 29
Dynamic Distributed Manager …
 At most (N-1) messages needed to locate the owner.
 As hints are updated as a side effect, the average number of
messages should be lower.
 Double fault:
 Consider a processor p doing a read first and then a write.
 p needs to get the page twice.
 One solution:
 Use sequence numbers along with page transfers.
 p can send the its page sequence number to the owner. Owner can
compare the numbers and decide whether a transfer is needed or
not.
 Still checking with owner is needed, only transfer of whole page
is avoided.

B. Prabhakaran 30
Memory Allocation
 Centralized scheme for memory allocation. A central
manager allocates and deallocates memory.
 A 2-level procedure may be more efficient:
 Central manager allocates a large chunk to local processors.
 Local manager handles local allocations.

B. Prabhakaran 31
Process Synchronization
 Needed to serialize concurrent accesses to a page.
 IVY uses eventcounts and provides 4 primitives:
 Init(ec): initialize an event count.
 Read(ec): returns the value of an event count.
 Await(ec, value): calling process waits until ec is not equal to the
specified value.
 Advance(ec): increments ec by one and wakes up waiting
processes.
 Primitives implemented on shared virtual memory:
 Any process can use eventcount (after initialization) without
knowing its location.
 When the page with eventcount is received by a processor,
eventcount operations are local to that processor and any number
of processes can use it.
B. Prabhakaran 32
Process Synchronization...
 Eventcount operations are atomic.
 using test-and-set instructions
 disallowing transfer of memory pages with eventcounts when
primitives are being executed.

B. Prabhakaran 33
Distributed Scheduling
 Motivation: A distributed system may have a mix of
heavily and lightly loaded systems. Hence, migrating a task
to share or balance load can help.
 Let P be the probability that the system is in a state in
which at least 1 task is waiting for service and at least 1
server is idle.
 Let ρ be the utilization of each server.
 We can estimate P using probabilistic analysis and plot a
graph against system utilization.
 For moderate system utilization, value of P is high, i.e., at
least 1 node is idle.
 Hence, performance can be improved by sharing of tasks.
B. Prabhakaran 1
Distributed Scheduling ...
P
1.0

N = 20

N = 10

N=5

0.3 ρ
0.2 1.0

B. Prabhakaran 2
What is Load?
 Load on a system/node can correspond to the queue length
of tasks/ processes that need to be processed.
 Queue length of waiting tasks: proportional to task response
time, hence a good indicator of system load.
 Distributing load: transfer tasks/processes among nodes.
 If a task transfer (from another node) takes a long time, the
node may accept more tasks during the transfer time.
 Causes the node to be highly loaded. Affects performance.
 Solution: artificially increment the queue length when a task
is accepted for transfer from remote node (to account for the
proposed increased in load).
 Task transfer can fail? : use timeouts.

B. Prabhakaran 3
Types of Algorithms
 Static load distribution algorithms: Decisions are hard-
coded into an algorithm with a priori knowledge of system.
 Dynamic load distribution: use system state information
such as task queue length, processor utilization.
 Adaptive load distribution: adapt the approach based on
system state.
 (e.g.,) Dynamic distribution algorithms collect load information
from nodes even at very high system loads.
 Load information collection itself can add load on the system as
messages need to be exchanged.
 Adaptive distribution algorithms may stop collecting state
information at high loads.

B. Prabhakaran 4
Balancing vs. Sharing
 Load balancing: Equalize load on the participating nodes.
 Transfer tasks even if a node is not heavily loaded so that queue
lengths on all nodes are approximately equal.
 More number of task transfers, might degrade performance.
 Load sharing: Reduce burden of an overloaded node.
 Transfer tasks only when the queue length exceeds a certain
threshold.
 Less number of task transfers.
 Anticipatory task transfers: transfer from overloaded nodes
to ones that are likely to become idle/lightly loaded.
 More like load balancing, but may be less number of transfers.

B. Prabhakaran 5
Types of Task Transfers
 Preemptive task transfers: transfer tasks that are partially
executed.
 Expensive as it involves collection of task states.
 Task state: virtual memory image, process control block, IO
buffers, file pointers, timers, ...
 Non-preemptive task transfers: transfer tasks that have not
begun execution.
 Do not require transfer of task states.
 Can be considered as task placements. Suitable for load sharing
not for load balancing.
 Both transfers involve information on user’s current
working directory, task privileges/priority.

B. Prabhakaran 6
Algorithm Components
 Transfer policy: to decide whether a node needs to transfer
tasks.
 Thresholds, perhaps in terms of number of tasks, are generally used.
(Another threshold can be processor utilization).
 When a load on a node exceeds a threshold T, the node becomes a
sender. When it falls below a threshold, it becomes a receiver.
 Selection Policy: to decide which task is to be transferred.
 Criteria: task transfer should lead to reduced response time, i.e.,
transfer overhead should be worth incurring.
 Simplest approach: select newly originated tasks. Transfer costs lower
as no state information is to be transferred. Non-preemptive transfers.
 Other factors for selection: smaller tasks have less overhead.
Location-dependent system calls minimal (else, messages need to be
exchanged to perform system calls at the original node).

B. Prabhakaran 7
Algorithm Components...
 Location Policy: to decide the receiving node for a task.
 Polling is generally used. A node polls/checks whether another is
suitable and willing.
 Polling can be done serially or in parallel (using multicast).
 Alternative: broadcasting a query, sort of invitation to share load.
 Information policy: for collecting system state information.
The collected information is used by transfer, selection, and
location.
 Demand-driven Collection: Only when a node is highly or lightly
loaded, i.e., when a node becomes a potential sender or receiver.
 Can be sender-initiated, receiver-initiated, or both(symmetric).

 Periodic: May not be adaptive. Collection may be done at high loads

worsening system performance.
 State-change driven: only when state changes by a certain degree.

B. Prabhakaran 8
System Stability
 Unstable system: long term arrival rate of work to a system
is greater than the CPU power.
 Load sharing/balancing algorithm may add to the system
load making it unstable. (e.g.,) load information collection
at high system loads.
 Effectiveness of algorithm: Effective if it improves the
performance relative to that of a system not using it.
(Effective algorithm cannot be unstable).
 Load balancing algorithms should avoid fruitless actions.
(e.g.,) processor thrashing: task transfer makes the receiver
highly loaded, so the task gets transferred again, perhaps
repeatedly.

B. Prabhakaran 9
Load Distributing Algorithms
 Sender-initiated: distribution initiated by an overloaded
node.
 Receiver-initiated: distribution initiated by lightly loaded
nodes.
 Symmetric: initiated by both senders and receivers. Has
advantages and disadvantages of both the approaches.
 Adaptive: sensitive to state of the system.

B. Prabhakaran 10
Sender-initiated
 Transfer Policy: Use thresholds.
 Sender if queue length exceeds T.
 Receiver if accepting a task will not make queue length exceed T.
 Selection Policy: Only newly arrived tasks.
 Location Policy:
 Random: Use no remote state information. Task transferred to a node
at random.
 No need for state collection. Unnecessary task transfers (processor

thrashing) may occur.

 Threshold: poll a node to find out if it is a receiver. Receiver must
accept the task irrespective of when it (task) actually arrives.
 PollLimit, ie., the number of polls, can be used to reduce overhead.

 Shortest: Poll a set of nodes. Select the receiver with shortest task
queue length.

B. Prabhakaran 11
Sender-initiated
Yes

Select node i Poll-set = Poll node

i in Poll-set
randomly No Poll-set U i i

Transfer task Yes QLength at i

Poll-set=Nil
to i <T
No
Yes
Task QLength + 1 Yes No. of polls
Arrives >T < PollLimit
No No
Queue the
task locally

B. Prabhakaran 12
Sender-initiated
 Information Policy: demand-driven.
 Stability: can become unstable at high loads.
 At high loads, it may become difficult for senders to find
receivers.
 Also, the number of senders increase at high system loads
thereby increasing the polling activity.
 Polling activity may make the system unstable at high loads.

B. Prabhakaran 13
Receiver-initiated
 Transfer Policy: uses thresholds. Queue lengths below T
identifies receivers and those above T identifies senders.
 Selection Policy: as before.
 Location Policy: Polling.
 A random node is polled to check if a task transfer would place its
queue length below a threshold.
 If not, the polled node transfers a task.
 Otherwise, poll another node till a static PollLimit is reached.
 If all polls fail, wait until another task is completed before starting
polling operation.
 Information policy: demand-driven.
 Stability: Not unstable since there are lightly loaded systems
that have initiated the algorithm.

B. Prabhakaran 14
Receiver-initiated
Yes

Select node i Poll-set = Poll node

i in Poll-set
randomly No Poll-set U i i

Transfer task Yes QLength at i

Poll-set=Nil
from i >T
No
Yes
Task QLength Yes No. of polls
Departs <T < PollLimit

Wait for No
some time

B. Prabhakaran 15
Receiver-initiated
 Drawback:
 Polling initiated by receiver implies that it is difficult to find
senders with new tasks.
 Reason: systems try to schedule tasks as and when they arrive.
 Effect: receiver-initiated approach might result in preemptive
transfers. Hence transfer costs are more.
 Sender-initiated: transfer costs are low as new jobs are
transferred and so no need for transferring task states.

B. Prabhakaran 16
Symmetric
 Senders search for receivers and vice-versa.
 Low loads: senders can find receivers easily. High loads:
receivers can find senders easily.
 May have disadvantages of both: polling at high loads can
make the system unstable. Receiver-initiated task transfers
can be preemptive and so expensive.
 Simple algorithm: combine previous two approaches.
 Above-average algorithm:
 Transfer Policy: Two adaptive thresholds instead of one. If a
node’s estimated average load is A, a higher threshold TooHigh >
A and a lower threshold TooLow < A are used.
 Load < TooLow -> receiver. Load > TooHigh -> sender.

B. Prabhakaran 17
Above-average Algorithm
 Location policy:
 Sender Component
 Node with TooHigh load, broadcasts a TooHigh message, sets TooHigh
timer, and listens for an Accept message.
 A receiver that gets the (TooHigh) message sends an Accept message,
increases its load, and sets AwaitingTask timer.
 If the AwaitingTask timer expires, load is decremented.
 On receiving the Accept message: if the node is still a sender, it
chooses the best task to transfer and transfers it to the node.
 When sender is waiting for Accept, it may receive a TooLow message
(receiver initiated). Sender sends TooHigh to that receiver. Do step 2 &
3.
 On expiration of TooHigh timer, if no Accept message is received,
system is highly loaded. Sender broadcasts a ChangeAverage message.

B. Prabhakaran 18
Above-average Algorithm...
 Receiver Component
 Node with TooLow load, broadcasts a TooLow message, sets a
TooLow timer, and listens for TooHigh message.
 If TooHigh message is received, do step 2 & 3 in Sender
Component.
 If TooLow timer expires before receiving any TooHigh message,
receiver broadcasts a ChangeAverage message to decrease the load
estimate at other nodes.
 Selection Policy: as discussed before.
 Information policy: demand driven. Average load is modified
based on system load. High loads may have less number of
senders progressively.
 Average system load is determined individually. There is a range of
acceptable load before trying to be a sender or a receiver.

B. Prabhakaran 19
Adaptive Algorithms
 Limit Sender’s polling actions at high load to avoid instability.
 Utilize the collected state information during previous polling operations
to classify nodes as: Sender/overloaded, receiver/underloaded, OK (in
acceptable load range).
 Maintained as separate lists for each class.
 Initially, each node assumes that all others are receivers.
 Location policy at sender:
 Sender polls the head of the receiver list.
 Polled node puts the sender at the head of it sender list. It informs the
sender whether it is a receiver, a sender, or a OK node.
 If the polled node is still a receiver, the new task is transferred.
 Else the sender updates the polled node’s status, polls the next potential
receiver.
 If this polling process fails to identify a receiver, the task can still be
transferred during a receiver-initiated dialogue.

B. Prabhakaran 20
Adaptive Algorithms…
 Location policy at receiver
 Receivers obtain tasks from potential senders. Lists are scanned in
the following order.
 Head to tail in senders list (most up-to-date info used), tail to head in
OK list (least up-to-date used), tail to head in receiver list.
 Least up-to-date used in the hope that status might have changed.
 Receiver polls the selected node. If the node is a sender, a task is
transferred.
 If the node is not a sender, both the polled node and receiver update
each other’s status.
 Polling process stops if a sender is found or a static PollLimit is
reached.

B. Prabhakaran 21
Adaptive Algorithms…
 At high loads, sender-initiated polling gradually reduces as
nodes get removed from receiver list (and become senders).
 Whereas at low loads, sender will generally find some receiver.
 At high loads, receiver-initiated works and can find a sender.
 At low loads, receiver may not find senders, but that does not affect
the performance.
 Algorithm dynamically becomes sender-initiated at low
loads and receiver-initiated at high loads.
 Hence, algorithm is stable and can use non-preemptive
transfers at low loads (sender initiated).

B. Prabhakaran 22
Selecting an Algorithm
 If a system never gets highly loaded, sender-initiated
algorithms work better.
 Stable, receiver-initiated algorithms better for high loads.
 Widely fluctuating loads: stable, symmetric algorithms.
 Widely fluctuating loads + high migration cost for
preemptive transfers: stable, sender-initiated algorithms.
 Heterogeneous work arrival: stable, adaptive algorithms.

B. Prabhakaran 23
Performance Comparison
7

Mean SYM
Response
Time SEND
RECV

1
0.5 1.0
Offered System Load

B. Prabhakaran 24
Implementation Issues
 Task placement
 A task that is yet to begin is transferred to a remote machine, and
starts its execution there.
 Task migration
 State Transfer:
 State includes contents of registers, task stack, task status

(running, blocked, etc.,), file descriptors, virtual memory

address space, temporary files, buffered messages.
 Other info: current working directory, signal masks and

handlers, references to children processes.

 Task is frozen (suspended) before state transfer.

 Unfreeze:
 Task is installed at the new machine, unfrozen, and is put in the

ready queue.

B. Prabhakaran 25
State Transfer
 Issues to be considered:
 Cost to support remote execution including delays due to freezing
the task.
 Duration of freezing the task should be small. Otherwise it can

result in timeouts of tasks interacting with the frozen one.

 Transfer memory as they are referenced to reduce delay?

 Residual Dependencies: amount of resources to be dedicated at the

former host for the migrated task.
 An implementation that does not transfer all the virtual memory

address space at the time of migration, but transfers them as

they are referenced.
 Need for redirection of messages to the migrated task.

 Location-dependent system calls at the former node.

B. Prabhakaran 26
State Transfer...
 Disadvantages of residual dependencies :
 Affects reliability as it becomes dependent on former host
 Affects performance since memory accesses can become slow as
pages need to be transferred
 Affects complexity as task states are distributed on several hosts.
 Location transparency:
 Migration should hide the location of tasks
 Message passing, process naming, file handling, and other activities
should be transparent of actual location.
 Task names and their locations can be maintained as hints. If the
hints fail, they can be updated by a broadcast query or through a
name server.

B. Prabhakaran 27
Recovery
 Failure of a site/node in a distributed system causes
inconsistencies in the state of the system.
 Recovery: bringing back the failed node in step with other
nodes in the system.
 Failures:
 Process failure:
 Deadlocks, protection violation, erroneous user input, etc.
 System failure:
 Failure of processor/system. System failure can have full/partial
amnesia.
 It can be a pause failure (system restarts at the same state it was
in before the crash) or a complete halt.
 Secondary storage failure: data inaccessible.
 Communication failure: network inaccessible.

B. Prabhakaran 1
Fault-to-Recovery
Fault
Manufacturing Design External Fatigue

Erroneous
System State

System
failure

B. Prabhakaran 2
Backward & Forward Recovery
 Forward Recovery:
 Assess damages that could be caused by faults, remove those
damages (errors), and help processes continue.
 Difficult to do forward assessment. Generally tough.

 Backward Recovery:
 When forward assessment not possible. Restore processes to
previous error-free state.
 Expensive to rollback states

 Does not eliminate same fault occurring again (i.e. loop on a

fault + recovery)
 Unrecoverable actions: print outs, cash dispensed at ATMs.

B. Prabhakaran 3
Recovery System Model
 For Backward Recovery
 A single system with secondary and stable storage
 Stable storage does not lose information on failures
 Stable storage used for logs and recovery points
 Stable storage assumed to be more secure than secondary
storage.
 Data on secondary storage assumed to be archived
periodically.

B. Prabhakaran 4
Approaches
 Operation-based Approach
 Maintaining logs: all modifications to the state of a process are
recorded in sufficient detail so that a previous state can be restored
by reversing all changes made to the state.
 (e.g.,) Commit in database transactions: a transaction if it is
committed to by all nodes, then the changes are permanent. If it does
not commit, the effect of transactions are to be undone.
 Updating-in-place: Every write (update) results in a log of (1) object
name (2) old object state (3) new state. Operations:
 A do operation updates & writes the log

 An undo operation uses the log to remove the effect of a do

 A redo operation uses the log to repeat a do

 Write-ahead-log: To avoid the problem of a crash after update and

before logging.
 Write (undo & redo) logs before update
B. Prabhakaran 5
Approaches
 State-based Approach
 Establish a recovery point where the process state is saved.
 Recovery done by restoring the process state at the recovery,
called a checkpoint. This process is called rollback.
 Process of saving called checkpointing or taking a check point.
 Rollback normally done to the most recent checkpoint, hence
many checkpoints are done over the execution of a process.
 Shadow pages technique can be used for checkpointing. Page
containing the object to be updated is duplicated and maintained as
a checkpoint in stable storage.
 Actual update done on page in secondary storage. Copy in

stable storage used for rollback.

B. Prabhakaran 6
Recovery in Concurrent Systems
 Distributed system state involves message exchanges.
 In distributed systems, rolling back one process can cause
the roll back of other processes.
 Orphan messages & the Domino effect: Assume Y fails
after sending m.
 X has record of m at x3 but Y has no record. m -> orphan
message.
 Y rolls back to y2 -> X should go to x2.
 If Z rolls back, X and Y has to go to x1 and y1 -> Domino effect,
roll back of one process causes one or more processes to roll back.
x1 x2 x3
X

y1 y2 m
Y

Z z2
z1
B. Prabhakaran 7
Lost Messages
 If Y fails after receiving m, it will rollback to y1.
 X will rollback to x1
 m will be a lost message as X has recorded it as sent and Y
has no record of receiving it.

X x1
m
y1
Y X
Failure

B. Prabhakaran 8
Livelocks
X x1
n1
y1 m1
Y X
Failure

X x1
n2
n1
y1 m2
Y X
2nd Rollback

• Y crashes before receiving n1. Y rolls back to Y1 -> X to x1.

• Y recovers, receives n1 and sends m2.
• X recovers, sends n2 but has no record of sending n1
• Hence, Y is forced to rollback second time. X also rolls back as it has
received m2 but Y has no record of m2.
• Above sequence can repeat indefinitely, causing a livelock.

B. Prabhakaran 9
Consistent Checkpoints
 Overcoming domino effect and livelocks: checkpoints
should not have messages in transit.
 Consistent checkpoints: no message exchange between any
pair of processes in the set as well as outside the set during
the interval spanned by checkpoints.
 {x1,y1,z1} is a strongly consistent checkpoint.

x1 x2 x3
X

y1 y2 m
Y

Z z2
z1

B. Prabhakaran 10
Synchronous Approach
 Checkpointing:
 First phase:
 An initiating process, Pi, takes a tentative checkpoint.
 Pi requests all other processes to take tentative checkpoints.
 Every process informs whether it was able to take checkpoint.
 A process can fail to take a checkpoint due to the nature of
application (e.g.,) lack of log space, unrecoverable transactions.
 Second phase:
 If all processes took checkpoints, Pi decides to make the
checkpoint permanent.
 Otherwise, checkpoints are to be discarded.
 Pi conveys this decision to all the processes as to whether
checkpoints are to be made permanent or to be discarded.

B. Prabhakaran 11
Assumptions: Synchronous Appr.
 Processes communicate by exchanging messages through
communication channels
 Channels are FIFO in nature.
 End-to-end protocols (e.g. TCP) are assumed to cope with
message loss due to rollback recovery and communication
failure.
 Communication failures do not partition the network.
 A process is not allowed to send messages between phase 1
and 2.

B. Prabhakaran 12
Synchronous Approach...
 Optimization:
 Taking a checkpoint is expensive and the algorithm discussed may
take unnecessary checkpoints.
Initiate
checkpointing
x1 x2 x3
X

y1 y2 y3
Y

Z z2 z3
z1

W w2 w3

B. Prabhakaran 13
Synchronous Approach...
 Optimization:
 Taking a checkpoint is expensive and the algorithm discussed may
take unnecessary checkpoints.
Initiate
checkpointing
x2 x3
X

y2 y3
Y

Z z2 z3

W w2 w3

B. Prabhakaran 14
Checkpointing Optimization
 Each process uses monotonically increasing labels in its
outgoing messages.
 Notations:
 L: largest label. S: smallest label
 Let m be the last message X received from Y after X’s last permanent
checkpoint. last_label_recdx[Y] = m.l, if m exists. Otherwise, it is set to
S.
 Let m be the first message X sent to Y after checkpointing at X
(permanent or temporary). first_label_sentx[Y] = m.l, if exists. Otherwise,
set to L.
 For a checkpointing request to Y, X sends last_label_recdx[Y].
 Y takes a temporary checkpoint iff last_label_recdx[Y] >=
first_label_senty[X]. i.e., X has received 1 or more messages after
checkpointing by Y and hence Y should take checkpoint.
 ckpt_cohortx = {Y | last_label_recdx[Y] > S}, i.e., the set of all processes
from which X has received messages after its checkpoint.
B. Prabhakaran 15
Checkpointing Optimization
 Initial state at all processes p:
 first_label_sentp[q] := L.
 OK-to_take_ckptp := “yes” if p is willing; “no” otherwise
 At initiator Pi:
 for all p in ckpt_cohortpi do send Take_a_tentative_ckpt
(Pi,last_label_recdpi[p]) message
 if all processes replied “yes”, then for all p in ckpt_cohortpi do
send Make_tentative_ckpt_permanent.
 Else send Undo_tentative_ckpt.
 At all processes p:
 Upon receiving Take_a_tentative_ckpt message from qdo
 if OK_to_take_ckptp = “yes” AND last_label_recdq[p] >=
first_label_sentp[q]
 take a tentative checkpoint.

B. Prabhakaran 16
Checkpointing Optimization...
 At all processes p:
 take a tentative checkpoint.
 for all processes r in ckpt_cohortp do send Take_a_tentative_ckpt
(p,last_label_recdp[r]) message
 if all processes r replied “yes” OK_to_take_ckptp := “yes
 else OK_to_take_ckptp := “no”
 send (p, OK_to_take_ckptp) to q.
 Upon receiving Make_tentative_ckpt_permanent message do
 Make tentative checkpoint permanent
 for all processes r in ckpt_cohortp do Send
Make_tentative_ckpt_permanent message
 Upon receiving Undo_tentative_ckpt message do
 Undo tentative checkpoint
 for all processes r in ckpt_cohortp do Send Undo_tentative_ckpt
message.

B. Prabhakaran 17
Synchronous Rollback
 Rolling back:
 First phase:
 Pi initiates a rollback asking if all processes are willing to rollback
to the previous checkpoint.
 Any process may say no, if it is involved in another recovery
process.
 Second phase:
 Pi conveys the decision on agreement to all others.

x1 Failure
X x2
X

y1 y2
Y

Z z2
z1

B. Prabhakaran 18
Rollback Optimization
 Additional Notation:
 last_label_sentx[Y] = m.l, if m exists. Otherwise, set to S.
 When X requests Y to restart from the permanent checkpoint, it sends
last_label_sentx[Y] along with its request. Y will restart from its permanent
checkpoint only if: last_label_recdy[X] > last_label_sentx[Y]
 roll_cohortx = {Y | X can send messages to Y}
 Algorithm:
 Initial State at all processes p:
 resume_executionp := true;
 for all processes q do last_label_recdp[q] := S;
 willing_to_rollp = “yes” if p is willing to roll back. “no” otherwise.
 At initiator process Pi:
 for all p in roll_cohortp do send Prepare_to_rollback (Pi,
last_label_sentPi[p]) message.

B. Prabhakaran 19
Rollback Optimization...
 At initiator process Pi...
 if all processes reply “yes”, then for all p in roll_cohortp do send
Roll_back message.
 else for all p in roll_cohortpi do send Donot_roll_back message.
 At all processes p:
 Upon receiving Prepare_to_rollback (q,last_label_sentq[p])
message from q do
 if willing_to_rollp AND last_label_recdp[q] >
last_label_sentq[p] AND (resume_executionp)
 resume_executionp := false;

 for all r in roll_cohortp do send Prepare_to_rollback(p,

last_label_sentp[r]) message;
 if all r in roll_cohortp replied “yes” then willing_to_rollp :=

“yes”
 else willing_to_rollp := “no”

 send (p, willing_to_rollp) message to q

B. Prabhakaran 20
Rollback Optimization...
 At all processes p:
 Upon receiving Roll_back message AND if resume_executionp = false do
 restart from p’s permanent checkpoint

 for all r in roll_cohortp do send Roll_back message

 Upon receiving Donot_roll_back message do

 resume execution

 for all r in roll_cohortp do send Donot_roll_back message

B. Prabhakaran 21
Rollback Optimization...
x1 (3)
X X
(2)
(0)
y1(4)
Y
(3) (0) (3)
Z
z1 (4)
Label

• X rolls back to x1. Y & Z to y1 and z1.

B. Prabhakaran 22
Rollback Optimization...
x1 (3)
X X
(0)
y1(4)
Y
(0)
Z
z1 (3) (4)
Label

• Both Y & Z do not roll back. X rolls back to x1

• Message 3 will be handled by retransmission of network protocol
(e.g., TCP)

B. Prabhakaran 23
Asynchronous Approach
 Disadvantages of Synchronous Approach:
 Additional message exchanges for taking checkpoints
 Delays normal executions as messages cannot be exchanged
during checkpointing.
 Unnecessary overhead if no failures occur between checkpoints.
 Asynchronous approach: independent checkpoints at each
processor. Identify a consistent set of checkpoints if needed,
for roll backs.
 E.g., {x3,y3,z2} not consistent; {x2,y2,z2} consistent. Used for
rollback
x1 x2 x3
X

y1 y2 y3
Y

Z z2
z1
B. Prabhakaran 24
Asynchronous Approach...
 Assumption: 2 types of logging.
 Volatile logging: takes less time but contents lost on failure. Periodically
flushed to stable logs.
 Stable log: may take more time but contents not lost.
 Logging: tuple {s, m, msgs_sent}. s process state, m message received,
msgs_sent the set of messages sent during the event.
 Event logging initiated on message receipt.
 Notations & data structures:
 RCVDi<-j (CkPti): Number of messages received by processor Pi from
Pj as per checkpoint CkPti.
 SENTi->j(CkPti): Number of messages sent by processor Pi to Pj as per
checkpoint CkPti.
 Basic Idea:
 Each processor keeps track of the number of messages sent/ received to/
from other processors.

B. Prabhakaran 25
Asynchronous Approach...
 Basic Idea ....
 Existence of orphan messages identified by comparing the number
of messages sent and received.
 If number of received messages > sent messages -> presence of
orphans -> receiving process needs to rollback.
 Algorithm:
 A recovering processor broadcasts a message to all processors.
 if Pi is the recovering processor, CkPti := latest stable log.
 else CkPti := latest event that took place in i.
 for k := 1 to N do (N the total number of processors in the system)
 for each neighboring processor j do send ROLLBACK

(i,SENTi->j(CkPti)) message.
 Wait for ROLLBACK message from every neighbor.

B. Prabhakaran 26
Asynchronous Approach...
 Algorithm ...
 for every ROLLBACK(j,c) message received from a neighbor j, i
does the following:
 if RCVDi<-j(CkPti) > c then /* orphans present */
 find the latest event e such that RCVDi<-j(e) = c;

 CkPti := e.

 end for k.
 Algorithm has |N| iterations.
 During kth (k != 1) iteration, Pi based CkPti determined in (k-1)th
iteration, computes SENTi->j(CkPti) for each neighbor.
 This value is sent in a ROLLBACK message (in kth iteration)
 At the end of each iteration, at least 1 processor will roll back to its
final recovery point.

B. Prabhakaran 27
Asynch. Approach Example
x1 ex1 ex2 ex3
X

ey1 ey2 ey3 failure

Y X
y1
ez1
Z
z1 ez2
 Y fails, restarts from y1. CkPtx is ex3 & CkPtz is ez2.
 1st iteration:
 Y sends RollBack(Y,2) to X & RollBack(Y,1) to Z
 X sends RollBack(X,1) to Y & RollBack(X,0) to Z
 Z send RollBack(Z,0) to X & RollBack(Z,1) to Y.
 Discussion:
 RCVDx<-y(CkPtx) = 3 > 2 (in Y’s RollBack message) CkPtx set
to ex2 to match the equality constraint.
 RCVDz<-y(CkPtz) = 2 > 1 (in Y’s message) CkPtz set to ez1.
B. Prabhakaran 28
Asynch. Approach Example..
 Discussion...
 At Y, RCVDy<-x and RCVDy<-z satisfy the constraints. So CkPty
is unchanged at y1.
 2n d iteration:
 Y sends RollBack(Y,2) to X & RollBack(Z,1) to Z
 X sends RollBack(X,0) to Z & RollBack(X,1) to Y
 Z sends RollBack(Z,1) to Y & RollBack(Z,0) to X
 Checkpoint y1 is same as ey2.
 {ex2, y1/ey2, ez1} are identified as consistent checkpoints
to rollback.

B. Prabhakaran 29
Distributed Databases
 Checkpointing objectives in distributed database systems
(DDBS):
 Normal operations should be minimally interfered with, by
checkpointing.
 A DDBS may update different objects in different sites, local
checkpointing at each site is better.
 For faster recovery, checkpoints be consistent (desirable property).
 Activity in DDBS is in terms of transactions. So in DDBS, a
consistent checkpoint should either include updates of a
transaction completely or not include it all.
 Issues in identifying checkpoints:
 How sites agree on what transactions are to be included
 Taking checkpoints without interference

B. Prabhakaran 30
DDBS Checkpointing
 Assumptions:
 Basic unit of activity is transactions
 Transactions follow some concurrency control protocol
 Lamport’s logical clocks used for time-stamping transactions.
 Failures detected by network protocols or timeouts
 Network partitioning never occurs
 Basic Idea
 All sites agree on a Global Checkpoint Number (GCPN)
 Transactions with timestamps <= GCPN are included in the
checkpoint. Called BCPTs: Before Checkpoint Transactions.
 Timestamps of After Checkpoint Transactions (ACPTs) > GCPN.
 Each site multiple versions of data items being updated by ACPTs
in volatile storage -> No interference during checkpointing.

B. Prabhakaran 31
DDBS Checkpointing ...
 Data Structures
 LC: local clock as per Lamport’s logical clock
 LCPN (local checkpoint number): determined locally for the
current checkpoint.
 Algorithm: initiated by checkpoint coordinator (CC). CC
uses checkpoint subordinates (CS).
 Phase 1 at the CC
 CC broadcasts a Checkpoint_Request message with a local

timestamp LCcc.
 LCPNcc := LCcc

 CONVERTcc := false

 Wait for replies from CSs.

 Phase 1 at CSs

B. Prabhakaran 32
DDBS Checkpointing ...
 Phase 1 at CSs
 On receiving a Checkpoint_Request message, a site m, updates its
local clock as LCm := MAX(LCm, LCcc+1)
 LCPNm := LCm
 m informs LCPNm to the CC
 CONVERTm := false
 m marks all the transactions with timestamps !> LCPNm as BCPTs
and the rest as temporary-ACPTs.
 All updates of temporary-ACPTs are stored in the buffers of the
ACPTs
 If a temporary-ACPT commits, updates are not flushed to the
database but maintained as committed temporary versions (CTVs).
 Other transactions access CTVs for reads. For writes, another
version of CTV is created.

B. Prabhakaran 33
DDBS Checkpointing ...
 Phase 2 at CC
 All CS’s replies received -> GCPN := Max(LCPN1, .., LCPNn)
 Broadcast GCPN
 Phase 2 at the CSs
 On receiving GCPN, m marks all temporary-ACPTs that satisfy the
following conditions as BCPTs:
 LCPNm < transaction time stamp <= GCPN
 Updates of the above converted BCPTs are included in checkpoints
 CONVERTm := true (i.e., GCPN & BCPTs identified)
 When all BCPTs terminate and CONVERTm = true, m takes a local
checkpoint by saving the state of the data objects.
 After local checkpointing, database is updated with CTVs and
CTVs are deleted.

B. Prabhakaran 34
Fault Tolerance
 Recovery: bringing back the failed node in step with other
nodes in the system.
 Fault Tolerance: Increase the availability of a service or the
system in the event of failures. Two ways of achieving it:
 Masking failures: Continue to perform its specified function in the
event of failure.
 Well defined failure behavior: System may or may not function in
the event of failure, but can facilitate actions suitable for recovery.
 (e.g.,) effect of database transactions visible only if committed
to by all sites. Otherwise, transaction is undone without any
effect to other transactions.
 Key approach to fault tolerance: redundancy. e.g., multiple
copies of data, multiple processes providing same service.
 Topics to be discussed: commit protocols, voting protocols.

B. Prabhakaran 1
Atomic Actions
 Example: Processes P1 & P2 share a data named X.
 P1: ... lock(X); X:= X + Z; unlock(X); ...
 P2: ... lock(X); X := X + Y; unlock(X); ...
 Updating of X by P1 or P2 should be done atomically i.e.,
without any interruption.
 Atomic operation if:
 the process performing it is not aware of existence of any others.
 the process doing it does not communicate with others during the
operation time.
 No other state change in the process except the operation.
 Effects on the system gives an impression of indivisible and
perhaps instantaneous operation.

B. Prabhakaran 2
Committing
 A group of actions is grouped as a transaction and the group
is treated as an atomic action.
 The transaction, during the course of its execution, decides
to commit or abort.
 Commit: guarantee that the transaction will be completed.
 Abort: guarantee not to do the transaction and erase any part
of the transaction done so far.
 Global atomicity: (e.g.,) A distributed database transaction
that must be processed at every or none of the sites.
 Commit protocols: are ones that enforce global atomicity.

B. Prabhakaran 3
2-phase Commit Protocol
 Distributed transaction carried out by a coordinator + a set of
cohorts executing at different sites.
 Phase 1:
 At the coordinator:
 Coordinator sends a COMMIT-REQUEST message to every cohort
requesting them to commit.
 Coordinator waits for reply from all others.
 At the cohorts:
 On receiving the request: if the transaction execution is successful,
the cohort writes UNDO and REDO log on stable storage. Sends
AGREED message to coordinator.
 Otherwise, sends an ABORT message.

 Phase 2:
 At the coordinator:
 All cohorts agreed? : write a COMMIT record on log, send
COMMIT request to all cohorts.
B. Prabhakaran 4
2-phase Commit Protocol ...
 Phase 2...:
 At the coordinator...:
 Otherwise, send an ABORT message
 Coordinator waits for acknowledgement from each cohort.
 No acknowledgement within a timeout period? : resend the
commit/abort message to that cohort.
 All acknowledgements received? : write a COMPLETE record
to the log.
 At the cohorts:
 On COMMIT message: resources & locks for the transaction released.
Send Acknowledgement to the coordinator.
 On ABO RT message: undo the transaction using UNDO log, release
resources & locks held by the transaction, send Acknowledgement.

B. Prabhakaran 5
Handling failures
 2-phase commit protocol handles failures as below:
 If coordinator crashes before writing the COMMIT record:
 on recovery, it will send ABORT message to all others.
 Cohorts who agreed to commit, will simply undo the transaction
using the UNDO log and abort.
 Other cohorts will simply abort.
 All cohorts are blocked till coordinator recovers.
 Coordinator crashes after COMMIT before writing COMPLETE
 On recovery, broadcast a COMMIT and wait for ack
 Cohort crashes in phase 1? : coordinator aborts the transaction.
 Cohort crashes in phase 2? : on recovery, it will check with the
coordinator whether to abort or commit.
 Drawback: blocking protocol. Cohorts blocked if coordinator
fails.
 Resources and locks held unnecessarily.
B. Prabhakaran 6
2-phase commit: State Machine
 Synchronous protocol: all sites proceed in rounds, i.e., a site
never leads another by more than 1 state transition.
 A state transition occurs in a process participating in the 2-
phase commit protocol whenever it receives/sends
messages.
 States: q (idle or querying state), w (wait), a (abort), c
(commit).
 When coordinator is in state q, cohorts are in q or a.
 Coordinator in w -> cohort can be in q, w, or a.
 Coordinator in a/c -> cohort is in w or a/c.
 A cohort in a/c: other cohorts may be in a/c or w.
 A site is never in c when another site is in q as the protocol
is synchronous.

B. Prabhakaran 7
2-phase commit: State Machine...
Coordinator Cohort i

qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts
All agreed/ wi ai
One or more abort w1 Commit Abort from
Commit msg received coordinator
reply/ Abort msg to all
sent to all cohorts from
coordinator
a1 c1 ci

B. Prabhakaran 8
Drawback
 Drawback: blocking protocol. Cohorts blocked if
coordinator fails.
 Resources and locks held unnecessarily.
 Conditions that cause blocking:
 Assume that only one site is operational. This site cannot decide
to abort a transaction as some other site may be in commit state.
 It cannot commit as some other site can be in abort state.
 Hence, the site is blocked until all failed sites recover.

B. Prabhakaran 9
Nonblocking Commit
 Nonblocking commit? :
 Sites should agree on the outcome by examining their local states.
 A failed site, upon recovery, should reach the same conclusion
regarding the outcome. Consistent with other working sites.
 Independent recovery: if a recovering site can decide on the final
outcome based solely on its local state.
 A nonblocking commit protocol can support independent recovery.
 Notations:
 Concurrency set: Let Si denote the state of the site i. The set of all
the states that may be concurrent with it is concurrency set (C(si)).
 (e.g.,) Consider a system having 2 sites.If site 2’s state is w2, then
C(w2) = {c1, a1, w1}. C(q2) = {q1, w1}. a1, c1 not in C(q2) as 2-
phase commit protocol is synchronous within 1 state transaction.
 Sender set: Let s be any state, M be the set of all messages
received in s. Sender set, S(s) = {i | site i sends m and m in M}

B. Prabhakaran 10
3-phase Commit
 Lemma: If a protocol contains a local state of a site with
both abort and commit states in its concurrency set, then
under independent recovery conditions it is not resilient to
an arbitrary single failure.
 In previous figure, C(W2) can have both abort and commit
states in the concurrency set.
 To make it a non-blocking protocol: introduce a buffer state
at both coordinator and cohorts.
 Now, C(W1) = {q2, w2, a2} and C(w2) = {a1, p1, w1}.

B. Prabhakaran 11
3-phase commit: State Machine
Coordinator Cohort i

qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts
All agreed/ wi ai
One or more abort w1 Abort from
Prepare msg Prep msg
reply/ Abort msg coordinator
to all received/
sent to all cohorts send Ack
a1 P1 Pi
All cohorts Commit
Ack/ Send Commit received from
msg to all coordinator
c1
ci

B. Prabhakaran 12
Failure, Timeout Transitions
 A failure transition occurs at a failed site at the instant it
fails or immediately after it recovers from the failure.
 Rule for failure transition: For every non-final state s (i.e., qi, wi,
pi) in the protocol, if C(s) contains a commit, then assign a failure
transition from s to a commit state in its FSA. Otherwise, assign a
failure transition from s to an abort state.
 Reason: pi is the only state with a commit state in its concurrency
set. If a site fails at pi, then it can commit on recovery. Any other
state failure, safer to abort.
 If site i is waiting on a message from j, i can time out. i can
determine the state of j based on the expected message.
 Based on j’s state, the final state of j can be determined
using failure transition at j.

B. Prabhakaran 13
Failure, Timeout Transitions
 This can be used for incorporating Timeout transitions at i.
 Rule for timeout transition: For each nonfinal state s, if site j in
S(s),and site j has a failure transition from s to a commit (abort)
state, then assign a timeout transition from s to a commit (abort)
state.
 Reason:
 Failed site makes a transition to a commit (abort) state using failure
transition rule.
 So, the operational site must make the same transition to ensure that
the final outcome is the same at all sites.

B. Prabhakaran 14
3-phase commit + Failure Trans.
Coordinator Cohort i

qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
F,T message sent to F,T
all cohorts F,T
wi ai
One or more abort w1 All agreed/ Abort from
Prepare msg Prep msg
reply/ Abort msg coordinator
to all received/
sent to all cohorts F,T send Ack Abort from
T P1 coordinator
a1 Pi
Abort to all All cohorts
cohorts Commit
Ack/ Send Commit F,T received from
F
msg to all coordinator
c1
F: Failure Transition ci
T: Timeout Transition
F,T: Failure/Timeout

B. Prabhakaran 15
Nonblocking Commit Protocol
 Phase 1:
 First phase identical to that of 2-phase commit, except for failures.
 Here, coordinator is in w1 and each cohort is in a or w or q,
depending on whether it has received the commit_request message
or not.
 Phase 2:
 Coordinator sends a Prepare message to all the cohorts (if all of
them sent Agreed message in phase 1).
 Otherwise, it will send an Abort message to them.
 On receiving a Prepare message, a cohort sends an
acknowledgement to the coordinator.
 If the coordinator fails before sending a Prepare message, it
aborts the transaction on recovery.
 Cohorts, on timing out on a Prepare message, also aborts the
transaction.

B. Prabhakaran 16
Nonblocking Commit Protocol
 Phase 3:
 On receiving acknowledgements to Prepare messages, the coordinator
sends a Commit message to all cohorts.
 Cohort commits on receiving this message.
 Coordinator fails before sending commit? : commits upon
recovery.
 So cohorts on Commit message timeout, commit to the transaction.
 Cohort failed before sending an acknowledgement? : coordinator
times out and sends an abort message to all others.
 Failed cohort aborts the transaction upon recovery.

 Use of buffer state:

 (e.g.,) Suppose state pi (in cohort) is not present. Let coordinator wait
in state p1 waiting for ack. Let cohort 2 (in w2) acknowledge and
commit.
 Suppose cohort 3 fails in w3. Coordinator will time out and abort.
Cohort 3 will abort on recovery. Inconsistent with cohort 2.

B. Prabhakaran 17
Commit Protocols Disadvantages
 No protocol using the above independent recovery technique
for simultaneous failure of more than 1 site.
 The above protocol is also not resilient to network
partitioning.
 Alternative: Use voting protocols.
 Basic idea of voting protocol:
 Each replica assigned some number of votes
 A majority of votes need to be collected before accessing a replica.
 Voting mechanism: more fault tolerant to site failures, network
partitions, and message losses.
 Types of voting schemes:
 Static
 Dynamic

B. Prabhakaran 18
Static Voting Scheme
 System Model:
 File replicas at different sites. File lock rule: either one writer + no
reader or multiple readers + no writer.
 Every file is associated with a version number that gives the number
of times a file has been updated.
 Version numbers are stored on stable storage. Every successful write
updates version number.
 Basic Idea:
 Every replica assigned a certain number of votes. This number stored
on stable storage.
 A read or write operation permitted if a certain number of votes, called
read quorum or write quorum, are collected by the requesting process.
 Voting Algorithm:
 Let a site i issue a read or write request for a file.
 Site i issues a Lock_Request to its local lock manager.

B. Prabhakaran 19
Static Voting ...
 Voting Algorithm...:
 When lock request is granted, i sends a Vote_Request message to
all the sites.
 When a site j receives a Vote_Request message, it issues a
Lock_Request to its lock manager. If the lock request is granted,
then it returns the version number of the replica (VNj) and the
number of votes assigned to the replica (Vj) at site i.
 Site i decides whether it has the quorum or not, based on replies
received within a timeout period as follows.
 For read requests, Vr = Sum of Vk, k in P, where P is the set of
sites from which replies were received.
 For write requests, Vw = Sum of Vk, k in Q such that:
 M = max{VN j: j is in P}
 Q = {j in P : VNj = M}
 Only the votes of the current (version) replicas are counted in
deciding the write quorum.

B. Prabhakaran 20
Static Voting ...
 Voting Algorithm...:
 If i is not successful in getting the quorum, it issues a Release
_Lock to the lock manager & to all sites that gave their votes.
 If i is successful in collecting the quorum, it checks whether its
copy of file is current (VNi = M). If not, it obtains the current
copy.
 If the request is read, i reads the local copy. If write, i updates the
local copy and VN.
 i sends all updates and VNi to all sites in Q, i.e., update only
current replicas. i sends a Release_Lock request to its lock
manager as well as those in P.
 All sites on receiving updates, perform updates. On receiving
Release_Lock, releases lock.
 Vote Assignment:
 Let v be the total number of votes assigned to all copies. Read &
write quorum, r & w, are selected such that: r + w > v; w > v/2.
B. Prabhakaran 21
Static Voting ...
 Vote Assignment ...:
 Above values are determined so that there is a non-null
intersection between every read and write quorum, i.e., at least 1
current copy in any reading quorum gathered.
 Write quorum is high enough to disallow simultaneous writes on 2
distinct subset of replicas.
 The scheme can be modified to collect write quorums from non-
current replicas. Another modification: obsolete replicas updated.
 (e.g.,) System with 4 replicas at 4 sites. Votes assigned: V1 = 1,
V2 = 1, V3 = 2, & V4 = 1.
 Let disk latency at S1 = 75 msec, S2 = 750 msec, S3 = 750 msec. &
S4 = 100 msec.
 If r = 1 & w = 5, and read access time is 75 ms and write access is 750
msec.

B. Prabhakaran 22
Dynamic Voting
 Assume in the above example, site 3 becomes unreachable
from other sites.
 Sites 1, 2, & 4 can still collect a quorum, however, Site 3
cannot.
 Another partition in {1,2,4} will make any site unavailable.
 Dynamic voting: adapt the number of votes or the set of
sites that can form a quorum, to the changing state of the
system due to sites & communication failures.
 Approaches:
 Majority based approach: set of sites change with system state.
This set can form a majority to allow access to replicated data.
 Dynamic vote reassignment: number of votes assigned to a site
changes dynamically.

B. Prabhakaran 23
Majority-based Approach
• Figure indicates the partitions and
ABCDE
(one) merger that takes place
• Assume one vote per copy.
ABD • Static voting scheme: only partitions
CE
ABCDE, ABD, & ACE allowed
access.
AB D • Majority-based approach: one partition
can collect quorums and the other cannot.
• Partitions ABCDE, ABD, AB, A, and
A B ACE can collect quorums, others cannot.

ACE

B. Prabhakaran 24
Majority Approach ...
 Notations used:
 Version Number, VNi: of a replica at a site i is an integer that
counts the number of successful updates to the replica at i.
Initially set to 0.
 Number of replicas updated, RUi: Number of replicas participating
in the most recent update. Initially set to the total number of
replicas.
 Distinguished sites list, DSi,: at i is a variable that stores IDs of
one or more sites. DSi depends on RUi.
 RUi is even: DSi identifies the replica that is greater (as per the
linear ordering) than all the other replicas that participated in
the most recent update at i.
 RUi is odd: DSi is nil.
 RUi = 3: DSi lists the 3 replicas that participated in the most
recent update from which a majority is needed to allow access
to data.

B. Prabhakaran 25
Majority Approach: Example
 Example:
 5 replicas of a file stored at sites A,B,C,D, and E.State of the
system is shown in table. Each replica has been updated 3 times,
RUi is 5 for all sites. DSi is nil (as RUi is odd and != 3).
 A B C D E
 VN 3 3 3 3 3
 RU 5 5 5 5 5
 DS - - - - -
 B receives an update request, finds it can communicate only to A
& C. B finds that RU is 5 for the last update. Since partition ABC
has 3 of the 5 copies, B decides that it belongs to a distinguished
partition. State of the system:
 A B C D E
 VN 4 4 4 3 3
 RU 3 3 3 5 5
 DS ABC ABC ABC - -

B. Prabhakaran 26
Majority Approach: Example...
 Example...:
 Now, C needs to do an update and finds it can communicate only
to B. Since RUc is 3, it chooses the static voting protocol and so
DS & RU are not updated. System state:
 A B C D E
 VN 4 5 5 3 3
 RU 3 3 3 5 5
 DS ABC ABC ABC - -
 Next, D makes an update, finds it can communicate with B,C, & E.
Latest version in BCDE is 5 with RU = 3. A majority from DS =
ABC is sought and is available (i.e., BC). RU is now set to 4. RU
is even, DS set to B (highest lexicographical order). System state:
 A B C D E
 VN 4 6 6 6 6
 RU 3 4 4 4 4
 DS ABC B B B B

B. Prabhakaran 27
Majority Approach: Example...
 Example...:
 C receives an update, finds it can communicate only with B. BC
has half the sites in the previous partition and has the distinguished
site B (DS is used to break the tie in the case of even numbers).
Update can be carried out in the partition. Resulting state:
 A B C D E
 VN 4 7 7 6 6
 RU 3 2 2 4 4
 DS ABC B B B B

B. Prabhakaran 28
Majority-based: Protocol
 Site i receives an update and executes following protocol:
 i issues a Lock_Request to its local lock manager
 Lock granted? : i issues a Vote_Request to all the sites.
 Site j receives the request: j issues a Lock_Request to its local lock
manager. Lock granted? : j sends the values of VNj, RUj, and DSj
to i.
 Based on responses received, i decides whether it belongs to the
distinguished partition procedure.
 i does not belong to distinguished partition? : issues Release_Lock
to local lock manager and Abort to other sites (which will issue
Release_Lock to their local lock manager).
 i belongs to distinguished partition? : performs update on local
copy (current copy obtained before update is local copy is not
current). i sends a commit message to participating sites with
missing updates and values of VN, RU, and DS. Issues a
Release_Lock request to local lock manager.

B. Prabhakaran 29
Majority-based: Protocol
 Site i receives an update and executes following protocol...:
 Site j receives commit message: updates its replica, RU, VN, & DS,
and sends Release_Lock request to local lock manager.
 Distinguished Partition: Let P denote the set of responding
sites.
 i calculates M (the most recent version in partition), Q (set of sites
containing version M), and N (the number of sites that participated
in the latest update):
 M = max{VNj : j in P}
 Q = {j in P : VNj = M}
 N = RUj, j in Q
 |Q| > N/2 ? : i is a member of distinguished partition.
 |Q| = N/2 ? : tie needs to be broken. Select a j in Q. If DSj in Q, i
belongs to the distinguished partition. (If RUj is even, DSj contains
the highest ordered site). i.e., i is in the partition containing the
distinguished site.

B. Prabhakaran 30
Majority-based: Protocol
 Distinguished Partition...:
 If N = 3 and if P contains 2 or all 3 sites indicated by DS variable
of the site in Q, i belongs to the distinguished partition.
 Otherwise, i does not belong to a distinguished partition.
 Update: invoked when a site is ready to commit. Variables
are updated as follows:
 VNi = M + 1
 RUi = cardinality of P (i.e., |P|)
 DSi is updated when N != 3 (since static voting protocol is used
when N = 3).
 DSi = K if RUi is even, where K is the highest ordered site
 DSi = P if RUi = 3
 Majority-based protocol can have deadlocks as it uses locks.
One needs a deadlock detection & resolution mechanism
along with this.
B. Prabhakaran 31
Failure Resilience
 Resilient process: is one that masks failures and guarantees
progress despite a certain number of failures.
 Approaches:
 Backup process:
 A primary process + 1 or more backup processes
 Primary process executes while backups are inactive
 If primary fails, one of the backup processes take over the
functions of the primary process.
 To facilitate this takeover, state of the primary process is
checkpointed at appropriate intervals.
 Checkpointed state stored in a place that will be available even
in the case of primary process failure.
 Plus:
 Little system resources are consumed by backup processes as they
are inactive.

B. Prabhakaran 32
Failure Resilience
 Resilient Approaches:
 Backup process...:
 Minus...:
 Delay in takeover as (1) backups need to detect primary’s

failure (mostly by timeouts of “heartbeat” messages) (2) backup

needs to recompute the system state based on checkpoint.
 Backup should not reissue IOs and resend messages already

sent by the primary process. Also, messages processed by the

primary after the checkpoint must be available for back up
process during the recomputation.
 Which backup to act as primary needs to be solved.

 Replicated execution:
 Several processes execute simultaneously. Service continues as
long as one of them is available.
 Plus:
 Provides increased reliability and availability

B. Prabhakaran 33
Failure Resilience
 Resilient Approaches...:
 Replicated execution...:
 Minus:
 More system resources needed for all processes

 Concurrent updates need to be handled

B. Prabhakaran 34

Logical Time: (Scalar Time, Vector Time, and Matrix Time)
No ratings yet
Logical Time: (Scalar Time, Vector Time, and Matrix Time)
81 pages
Hill Resort Synopsis
91% (22)
Hill Resort Synopsis
5 pages
9+ Final Year Project Proposal Examples - PDF - Examples
No ratings yet
9+ Final Year Project Proposal Examples - PDF - Examples
11 pages
Theoretical Aspects: Logical Clocks Causal Ordering Global State Recording Termination Detection
No ratings yet
Theoretical Aspects: Logical Clocks Causal Ordering Global State Recording Termination Detection
39 pages
Unit-2 1
No ratings yet
Unit-2 1
53 pages
SES Algorithm
No ratings yet
SES Algorithm
23 pages
Causal Ordering
No ratings yet
Causal Ordering
26 pages
Clocks and Global State
No ratings yet
Clocks and Global State
55 pages
M.Tech Course Distributed Computing
No ratings yet
M.Tech Course Distributed Computing
117 pages
Causal Ordering of Messages: Space P1 Send (M1)
No ratings yet
Causal Ordering of Messages: Space P1 Send (M1)
8 pages
Distributed Systems Fundamentals: Ij I Ij I I J I J I J
No ratings yet
Distributed Systems Fundamentals: Ij I Ij I I J I J I J
10 pages
Lecture Notes: Distributed OS Theories
No ratings yet
Lecture Notes: Distributed OS Theories
14 pages
Clocks DOS
No ratings yet
Clocks DOS
9 pages
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
No ratings yet
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
33 pages
Distributed System
No ratings yet
Distributed System
30 pages
Ds File
No ratings yet
Ds File
17 pages
Chapter 5 - Theoretical Foundations
No ratings yet
Chapter 5 - Theoretical Foundations
36 pages
Synchronization in Distributed Systems
No ratings yet
Synchronization in Distributed Systems
51 pages
Week-6 Lecture 2, 3
No ratings yet
Week-6 Lecture 2, 3
24 pages
3 Synchronization
No ratings yet
3 Synchronization
93 pages
Chapter 6 Synchronization
No ratings yet
Chapter 6 Synchronization
37 pages
Lecture10 LogicalClock
No ratings yet
Lecture10 LogicalClock
43 pages
Logical Time in Asynchronous Systems Email Example: A B A B
No ratings yet
Logical Time in Asynchronous Systems Email Example: A B A B
8 pages
Distributed System - Event Ordering
No ratings yet
Distributed System - Event Ordering
12 pages
Distributed System 25 Questions
No ratings yet
Distributed System 25 Questions
19 pages
Ordering and Consistent Cuts
No ratings yet
Ordering and Consistent Cuts
35 pages
6.2 Lamport 1 Logical
No ratings yet
6.2 Lamport 1 Logical
27 pages
Logical Clock Implementation in The Distributed System: ISSN: 2454-132X Impact Factor: 4.295
No ratings yet
Logical Clock Implementation in The Distributed System: ISSN: 2454-132X Impact Factor: 4.295
6 pages
Distributed System
No ratings yet
Distributed System
5 pages
Distributed System Lab
100% (1)
Distributed System Lab
36 pages
Clock Synchronization in Centralized Systems
No ratings yet
Clock Synchronization in Centralized Systems
35 pages
Synchronization in Distributed Systems
No ratings yet
Synchronization in Distributed Systems
55 pages
Synchronization
No ratings yet
Synchronization
67 pages
Distributed Synchronization
No ratings yet
Distributed Synchronization
46 pages
Practical (CS 7001) - Distributed System
0% (1)
Practical (CS 7001) - Distributed System
12 pages
Event Ordering
No ratings yet
Event Ordering
12 pages
Lecturernotes - Module - 3 - BCS515D - Distributed Systems
No ratings yet
Lecturernotes - Module - 3 - BCS515D - Distributed Systems
11 pages
Lamport
No ratings yet
Lamport
3 pages
Ans May Jun 2023
No ratings yet
Ans May Jun 2023
21 pages
Distributed System TYPED NOTES
No ratings yet
Distributed System TYPED NOTES
40 pages
Module 2
No ratings yet
Module 2
67 pages
Synchro
No ratings yet
Synchro
21 pages
Coordination: CE32204 - Distributed System Presented By: Eka Stephani Sinambela Institut Teknologi Del
No ratings yet
Coordination: CE32204 - Distributed System Presented By: Eka Stephani Sinambela Institut Teknologi Del
16 pages
CST402 Scheme
No ratings yet
CST402 Scheme
9 pages
LM10
No ratings yet
LM10
43 pages
Advanced Operating System CSN-502: Design Issues (Distributed OS) Issue 1: Time in Distributed Systems
No ratings yet
Advanced Operating System CSN-502: Design Issues (Distributed OS) Issue 1: Time in Distributed Systems
7 pages
Logical Clocks in Distributed Systems
No ratings yet
Logical Clocks in Distributed Systems
68 pages
MOD4
No ratings yet
MOD4
38 pages
Clocks-And Event Ordering-II
No ratings yet
Clocks-And Event Ordering-II
35 pages
3 Synchronisation and Coordination
No ratings yet
3 Synchronisation and Coordination
119 pages
Chapter 3
No ratings yet
Chapter 3
5 pages
BSS Ses
100% (1)
BSS Ses
11 pages
C1 C2 C3 Review DCmodel GlobalStates TimeCausality
No ratings yet
C1 C2 C3 Review DCmodel GlobalStates TimeCausality
81 pages
Clocks and States
No ratings yet
Clocks and States
42 pages
06 Synchronization
No ratings yet
06 Synchronization
52 pages
Unit 1
No ratings yet
Unit 1
21 pages
Chapter 6 Synchronization
No ratings yet
Chapter 6 Synchronization
50 pages
Unit II 2 Marks With Answer
No ratings yet
Unit II 2 Marks With Answer
3 pages
Time, Clocks, and The Ordering of Events in A Distributed System
No ratings yet
Time, Clocks, and The Ordering of Events in A Distributed System
20 pages
Causal Ordering of Messages in Distributed System
No ratings yet
Causal Ordering of Messages in Distributed System
4 pages
Analytic Geometry: Graphic Solutions Using Matlab Language
From Everand
Analytic Geometry: Graphic Solutions Using Matlab Language
Ing. Mario Castillo
No ratings yet
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Sikora Strada Final
No ratings yet
Sikora Strada Final
11 pages
Barangay Ned: Republic of The Philippines Province of South Cotabato Municipality of Lake Sebu
No ratings yet
Barangay Ned: Republic of The Philippines Province of South Cotabato Municipality of Lake Sebu
2 pages
Rosa Godines PC Statement
No ratings yet
Rosa Godines PC Statement
2 pages
Momentum Formula Sheet
No ratings yet
Momentum Formula Sheet
2 pages
The Elements of Journalism and The Philippines
50% (2)
The Elements of Journalism and The Philippines
5 pages
2608 5UIN - Uma - Edited
No ratings yet
2608 5UIN - Uma - Edited
8 pages
Luxury Brands and Their History
No ratings yet
Luxury Brands and Their History
12 pages
Construction NDA Template
No ratings yet
Construction NDA Template
2 pages
NB 7506 Release Notes
No ratings yet
NB 7506 Release Notes
70 pages
Challenges For Information-Flow Security: University of Pennsylvania, Philadelphia PA 19104, USA
No ratings yet
Challenges For Information-Flow Security: University of Pennsylvania, Philadelphia PA 19104, USA
5 pages
Chapter 6: Conceptual Framework - Recognition and Measurement Recognition
No ratings yet
Chapter 6: Conceptual Framework - Recognition and Measurement Recognition
5 pages
Random General Knowledge
No ratings yet
Random General Knowledge
7 pages
Example Rules For Preventive Ethics
No ratings yet
Example Rules For Preventive Ethics
7 pages
Chapter 11 - PPT - PPT - Updated
No ratings yet
Chapter 11 - PPT - PPT - Updated
20 pages
Bay Leaf
No ratings yet
Bay Leaf
8 pages
Three Stages of Falling in Love
0% (1)
Three Stages of Falling in Love
3 pages
English Work Book Class 9
No ratings yet
English Work Book Class 9
17 pages
CEP Company Profile
No ratings yet
CEP Company Profile
3 pages
Perception Towards Online Shopping
No ratings yet
Perception Towards Online Shopping
6 pages
CE2602-CE2603 MATLAB Assignment 2018-2019
100% (1)
CE2602-CE2603 MATLAB Assignment 2018-2019
7 pages
AI IMP Question Bank
No ratings yet
AI IMP Question Bank
4 pages
Bio101 Student Notes B
No ratings yet
Bio101 Student Notes B
13 pages
De Cuong On Tap HKI Tieng Anh 8 Global
No ratings yet
De Cuong On Tap HKI Tieng Anh 8 Global
5 pages
Lab 05 PDF
No ratings yet
Lab 05 PDF
7 pages
2.introduction To Medical Humanities Notes
No ratings yet
2.introduction To Medical Humanities Notes
4 pages
Malaysian Boy Names - Malay Boys Name With Meaning
No ratings yet
Malaysian Boy Names - Malay Boys Name With Meaning
1 page
10.4324 9781315719016-2 Chapterpdf
No ratings yet
10.4324 9781315719016-2 Chapterpdf
16 pages
Law Admission Test (LAT) Past Papers July 2019
No ratings yet
Law Admission Test (LAT) Past Papers July 2019
12 pages