Dos PPT PDF
Dos PPT PDF
Logical Clocks
Causal Ordering
Global State Recording
Termination Detection
B. Prabhakaran 1
Lamport’s Clock
Happened before relation:
a -> b : Event a occurred before event b. Events in the same
process.
a -> b : If a is the event of sending a message m in a process
and b is the event of receipt of the same message m by another
process.
a -> b, b -> c, then a -> c. “->” is transitive.
Causally Ordered Events
a -> b : Event a “causally” affects event b
Concurrent Events
a || b: if a !-> b and b !-> a
B. Prabhakaran 2
Space-time Diagram
Space P1 e11 e12 e13 e14
Internal
Events
Messages
P2
e21 e22 e23 e24
Time
B. Prabhakaran 3
Logical Clocks
Conditions satisfied:
Ci is clock in Process Pi.
If a -> b in process Pi, Ci(a) < Ci(b)
Let a: sending message m in Pi; b : receiving message m in Pj;
then, Ci(a) < Cj(b).
Implementation Rules:
R1: Ci = Ci + d (d > 0); clock is updated between two
successive events.
R2: Cj = max(Cj, tm+ d); (d > 0); When Pj receives a message
m with a time stamp tm (tm assigned by Pi, the sender; tm =
Ci(a), a being the event of sending message m).
A reasonable value for d is 1
B. Prabhakaran 4
Space-time Diagram
Space P1 e11 e12 e13 e14 e15 e16 e17
(1) (2) (3) (4) (5) (6) (7)
B. Prabhakaran 5
Limitation of Lamport’s Clock
Space P1 e11 e12 e13
(1) (2) (3)
B. Prabhakaran 7
Vector Clocks Comparison
1. Equal: ta = tb iff i, ta[i] = tb[i]
2. Not Equal: ta != tb iff ta[i] != tb[i], for at least one i
3. Less than or equal: ta <= tb iff ta[i] <= tb[i], for all i
4. Less than : ta < tb iff ta[i] <= tb[i] and ta[i] != tb[i], for all i
5. Concurrent: ta || tb iff ta !< tb and tb !< ta
6. Not less than or equal ...
7. Not less than ..
B. Prabhakaran 8
Vector Clock …
Space P1 e11 e12 e13
(1,0,0) (2,0,0) (3,4,1)
P3 e31 e32
(0,0,1) (0,0,2)
Time
B. Prabhakaran 9
Causal Ordering of Messages
Space Send(M1)
P1
Send(M2)
P2
P3 (1)
(2)
Time
B. Prabhakaran 10
Message Ordering …
Not really worry about maintaining clocks.
Order the messages sent and received among all processes
in a distributed system.
(e.g.,) Send(M1) -> Send(M2), M1 should be received
ahead of M2 by all processes.
This is not guaranteed by the communication network since
M1 may be from P1 to P2 and M2 may be from P3 to P4.
Message ordering:
Deliver a message only if the preceding one has already been
delivered.
Otherwise, buffer it up.
B. Prabhakaran 11
BSS Algorithm
B. Prabhakaran 12
BSS Algorithm ...
1. Process Pi increments the vector time VTpi[i], time stamps,
and broadcasts the message m. VTpi[i] - 1 denotes the number
of messages preceding m.
2. Pj != Pi receives m. m is delivered when:
a. VTpj[i] == VTm[i] - 1
b. VTpj[k] >= VTm[k] for all k in {1,2,..n} - {i}, n is the
total number of processes. Delayed message are queued
in a sorted manner.
c. Concurrent messages are ordered by time of receipt.
3. When m is delivered at Pj, VTpj updated according Rule 2 of
vector clocks.
2(a) : Pj has received all Pi’s messages preceding m.
2(b): Pj has received all other messages received by Pi
before sending m.
B. Prabhakaran 13
BSS Algorithm …
P1 (0,0,0) (buffer) (0,0,1) (0,1,1)
deliver
from buffer
P2 (0,0,1) (0,1,1)
P3
(0,0,1) (0,1,1)
P4
B. Prabhakaran 14
SES Algorithm
SES: Schiper-Eggli-Sandoz Algorithm. No need for
broadcast messages.
Each process maintains a vector V_P of size N - 1, N
the number of processes in the system.
V_P is a vector of tuple (P’,t): P’ the destination
process id and t, a vector timestamp.
Tm: logical time of sending message m
Tpi: present logical time at pi
Initially, V_P is empty.
B. Prabhakaran 15
SES Algorithm
Sending a Message:
Send message M, time stamped tm, along with V_P1 to P2.
Insert (P2, tm) into V_P1. Overwrite the previous value of
(P2,t), if any.
(P2,tm) is not sent. Any future message carrying (P2,tm) in
V_P1 cannot be delivered to P2 until tm < tP2.
Delivering a message
If V_M (in the message) does not contain any pair (P2, t), it can
be delivered.
/* (P2, t) exists */ If t !< Tp2, buffer the message. (Don’t
deliver).
else (t < Tp2) deliver it
B. Prabhakaran 16
SES Algorithm ...
What does the condition t ≥ Tp2 imply?
t is message vector time stamp.
t > Tp2 -> For all j, t[j] > Tp2[j]
This implies some events occurred without P2’s knowledge in
other processes. So P2 decides to buffer the message.
When t < Tp2, message is delivered & Tp2 is updated
with the help of V_P2 (after the merge operation).
B. Prabhakaran 17
SES Buffering Example
P1 (0,0,0) Tp1: (1,1,0) (2,2,2)
V_P2:
Tp2: (P1, <0,1,0>)
P2 (0,1,0) (0,2,0) (P3, <0,2,0>)
M1 M2
V_P2 V_P2:
empty (P1, <0,1,0>) V_P3:
P3 M3 M4 (P1,<0,2,2>)
Tp3: (0,2,1) (0,2,2) V_P3:
(P1,<0,1,0>)
B. Prabhakaran 18
SES Buffering Example...
M1 from P2 to P1: M1 + Tm (=<0,1,0>) + Empty V_P2
M2 from P2 to P3: M2 + Tm (<0, 2, 0>) + (P1, <0,1,0>)
M3 from P3 to P1: M3 + <0,2,2> + (P1, <0,1,0>)
M3 gets buffered because:
Tp1 is <0,0,0>, t in (P1, t) is <0,1,0> & so Tp1 < t
When M1 is received by P1:
Tp1 becomes <1,1,0>, by rules 1 and 2 of vector clock.
After updating Tp1, P1 checks buffered M3.
Now, Tp1 > t [in (P1, <0,1,0>].
So M3 is delivered.
B. Prabhakaran 19
SES Algorithm ...
On delivering the message:
Merge V_M (in message) with V_P2 as follows.
If (P,t) is not there in V_P2, merge.
If (P,t) is present in V_P2, t is updated with max(t[i] in Vm,
t[i] in V_P2). {Component-wise maximum}.
Message cannot be delivered until t in V_M is less
than t in V_P2
Update site P2’s local, logical clock.
Check buffered messages after local, logical clock
update.
B. Prabhakaran 20
SES Algorithm …
P1 (1,2,1) (2,2,1)
P2 (0,1,1) M2
(0,2,1) V_P2 is
empty
P3
M1 (0,2,2)
(0,0,1)
V_P3 is
empty
B. Prabhakaran 21
Handling Multicasts
Each node can maintain n x n matrix M, n being the
number of processes.
Node i multicasts to j and k: increments Mi[i,j] and
Mi[i,k]. M sent along with the message.
When node j receives message m from i, it can be
delivered if and only if:
Mj[i,j] = Mm[i,j] - 1
Mj[k,j] >= Mm[k,j] for all k != i.
Else buffer the message
On message delivery: Mj[x,y] = max(Mj[x,y], Mm[x,y])
B. Prabhakaran 22
Handling Multicasts: Example
000 000 000 000
000 101 000 101
P1 000 110 110 110
000
000
P2 110 M2
000
101
110
P3
M1 000
000 101
000 110
110
B. Prabhakaran 23
Global State
Global State 1
C1: Empty
$500 $200
A C2: Empty B
Global State 2
C1: Tx $50
$450 $200
A C2: Empty B
Global State 3
C1: Empty
$450 $250
A C2: Empty B
B. Prabhakaran 24
Recording Global State...
(e.g.,) Global state of A is recorded in (1) and not in (2).
State of B, C1, and C2 are recorded in (2)
Extra amount of $50 will appear in global state
Reason: A’s state recorded before sending message and C1’s state
after sending message.
Inconsistent global state if n < n’, where
n is number of messages sent by A along channel before A’s state
was recorded
n’ is number of messages sent by A along the channel before
channel’s state was recorded.
Consistent global state: n = n’
B. Prabhakaran 25
Recording Global State...
Similarly, for consistency m = m’
m’: no. of messages received along channel before B’s state recording
m: no. of messages received along channel by B before channel’s state was
recorded.
Also, n’ >= m, as in no system no. of messages sent along the
channel be less than that received
Hence, n >= m
Consistent global state should satisfy the above equation.
Consistent global state:
Channel state: sequence of messages sent before recording sender’s state,
excluding the messages received before receiver’s state was recorded.
Only transit messages are recorded in the channel state.
B. Prabhakaran 26
Recording Global State
Send(Mij): message M sent from Si to Sj
rec(Mij): message M received by Sj, from Si
time(x): Time of event x
LSi: local state at Si
send(Mij) is in LSi iff (if and only if) time(send(Mij)) <
time(LSi)
rec(Mij) is in LSj iff time(rec(Mij)) < time(LSj)
transit(LSi, LSj) : set of messages sent/recorded at LSi
and NOT received/recorded at LSj
B. Prabhakaran 27
Recording Global State …
inconsistent(LSi,LSj): set of messages NOT sent/recorded
at LSi and received/recorded at LSj
Global State, GS: {LS1, LS2,…., LSn}
Consistent Global State, GS = {LS1, ..LSn} AND for all i
in n, inconsistent(LSi,LSj) is null.
Transitless global state, GS = {LS1,…,LSn} AND for all
i in n, transit(LSi,LSj) is null.
B. Prabhakaran 28
Recording Global State ..
LS1 M2
S1 M1
S2
LS2
M1: transit
M2: inconsistent
B. Prabhakaran 29
Recording Global State...
Strongly consistent global state: consistent and transitless,
i.e., all send and the corresponding receive events are
recorded in all LSi.
LS11 LS12
LS22 LS23
LS21
B. Prabhakaran 30
Chandy-Lamport Algorithm
Distributed algorithm to capture a consistent global state. Communication channels
assumed to be FIFO.
Uses a marker to initiate the algorithm. Marker sort of dummy message, with no effect
on the functions of processes.
Sending Marker by P:
P records its state.
For each outgoing channel C, P sends a marker on C before P sends further
messages along C.
Receiving Marker by Q:
If Q has NOT recorded its state: (a). Record the state of C as an empty sequence.
(b) SEND marker (use above rule).
Else (Q has recorded state before): Record the state of C as sequence of messages
received along C, after Q’s state was recorded and before Q received the marker.
FIFO channel condition + markers help in satisfying consistency
condition.
B. Prabhakaran 31
Chandy-Lamport Algorithm
Initiation of marker can be done by any process, with its own unique
marker: <process id, sequence number>.
Several processes can initiate state recording by sending markers.
Concurrent sending of markers allowed.
One possible way to collect global state: all processes send the
recorded state information to the initiator of marker. Initiator process
can sum up the global state.
Seq
Si Sj
Sc
Seq’
B. Prabhakaran 32
Chandy-Lamport Algorithm ...
Example:
Pi Pj Pk
Send Send
Record Marker Record Marker Record
channel channel channel
state state state
Channel state example: M1 sent to Px at t1, M2 sent to Py at t2, ….
B. Prabhakaran 33
Chandy-Lamport Algorithm ...
Pi
Pj
B. Prabhakaran 34
Cuts
Cuts: graphical representation of a global state.
Cut C = {c1, c2, .., cn}; ci: cut event at Si.
Consistent Cut: If every message received by a Si before
a cut event, was sent before the cut event at Sender.
One can prove: A cut is a consistent cut iff no two cut
events are causally related, i.e., !(ci -> cj) and !(cj -> ci).
B. Prabhakaran 35
Time of a Cut
C = {c1, c2, .., cn} with vector time stamp VTci. Vector
time of the cut, VTc = sup(VTc1, VTc2, .., VTcn).
sup is a component-wise maximum, i.e., VTci =
max(VTc1[i], VTc2[i], .., VTcn[i]).
Now, a cut is consistent iff VTc = (VTc1[1], VTc2[2], ..,
VTcn[n]).
B. Prabhakaran 36
Termination Detection
Termination: completion of the sequence of algorithm. (e.g.,) leader
election, deadlock detection, deadlock resolution.
Use a controlling agent or a monitor process.
Initially, all processes are idle. Weight of controlling agent is 1 (0 for
others).
Start of computation: message from controller to a process. Weight: split
into half (0.5 each).
Repeat this: any time a process send a computation message to another
process, split the weights between the two processes (e.g., 0.25 each for the
third time).
End of computation: process sends its weight to the controller. Add this
weight to that of controller’s. (Sending process’s weight becomes 0).
Rule: Sum of W always 1.
Termination: When weight of controller becomes 1 again.
B. Prabhakaran 37
Huang’s Algorithm
B. Prabhakaran 38
Huang’s Algorithm
1/4 P1 0.5 P1
P4 P5 0 P4 0 P5
1/8 1/16
B. Prabhakaran 39
Mutual Exclusion Algorithms
• Non-token based:
• A site/process can enter a critical section when an
assertion (condition) becomes true.
• Algorithm should ensure that the assertion will be true
in only one site/process.
• Token based:
• A unique token (a known, unique message) is shared
among cooperating sites/processes.
• Possessor of the token has access to critical section.
• Need to take care of conditions such as loss of token,
crash of token holder, possibility of multiple tokens, etc.
B. Prabhakaran 1
General System Model
At any instant, a site may have several requests for critical
section (CS), queued up, and serviced one at a time.
Site States: Requesting CS, executing CS, idle (neither
requesting nor executing CS).
Requesting CS: blocked until granted access, cannot
make additional requests for CS.
Executing CS: using the CS.
Idle: action is outside the site. In token-based approaches,
idle site can have the token.
B. Prabhakaran 2
Mutual Exclusion: Requirements
Freedom from deadlocks: two or more sites should not
endlessly wait on conditions/messages that never become
true/arrive.
Freedom from starvation: No indefinite waiting.
Fairness: Order of execution of CS follows the order of
the requests for CS. (equal priority).
Fault tolerance: recognize “faults”, reorganize, continue.
(e.g., loss of token).
B. Prabhakaran 3
Performance
Number of messages per CS invocation: should be
minimized.
Synchronization delay, i.e., time between the leaving of
CS by a site and the entry of CS by the next one: should
be minimized.
Response time: time interval between request messages
transmissions and exit of CS.
System throughput, i.e., rate at which system executes
requests for CS: should be maximized.
If sd is synchronization delay, E the average CS execution
time: system throughput = 1 / (sd + E).
B. Prabhakaran 4
Performance metrics
Next site
Last site
enters CS
exits CS
Synchronization Time
delay
Messages
CS Request sent Enter CS Exit CS
arrives
E Time
Response Time
B. Prabhakaran 5
Performance ...
Low and High Load:
Low load: No more than one request at a given point in time.
High load: Always a pending mutual exclusion request at a site.
Best and Worst Case:
Best Case (low loads): Round-trip message delay + Execution
time. 2T + E.
Worst case (high loads).
Message traffic: low at low loads, high at high loads.
Average performance: when load conditions fluctuate
widely.
B. Prabhakaran 6
Simple Solution
Control site: grants permission for CS execution.
A site sends REQUEST message to control site.
Controller grants access one by one.
Synchronization delay: 2T -> A site release CS by
sending message to controller and controller sends
permission to another site.
System throughput: 1/(2T + E). If synchronization delay
is reduced to T, throughput doubles.
Controller becomes a bottleneck, congestion can occur.
B. Prabhakaran 7
Non-token Based Algorithms
Notations:
Si: Site I
Ri: Request set, containing the ids of all Sis from which
permission must be received before accessing CS.
Non-token based approaches use time stamps to order requests
for CS.
Smaller time stamps get priority over larger ones.
Lamport’s Algorithm
Ri = {S1, S2, …, Sn}, i.e., all sites.
Request queue: maintained at each Si. Ordered by time stamps.
Assumption: message delivered in FIFO.
B. Prabhakaran 8
Lamport’s Algorithm
Requesting CS:
Send REQUEST(tsi, i). (tsi,i): Request time stamp. Place REQUEST in
request_queuei.
On receiving the message; sj sends time-stamped REPLY message to
si. Si’s request placed in request_queuej.
Executing CS:
Si has received a message with time stamp larger than (tsi,i) from all
other sites.
Si’s request is the top most one in request_queuei.
Releasing CS:
Exiting CS: send a time stamped RELEASE message to all sites in its
request set.
Receiving RELEASE message: Sj removes Si’s request from its queue.
B. Prabhakaran 9
Lamport’s Algorithm…
Performance.
3(N-1) messages per CS invocation. (N - 1) REQUEST, (N - 1)
REPLY, (N - 1) RELEASE messages.
Synchronization delay: T
Optimization
Suppress reply messages. (e.g.,) Sj receives a REQUEST
message from Si after sending its own REQUEST message with
time stamp higher than that of Si’s. Do NOT send REPLY
message.
Messages reduced to between 2(N-1) and 3(N-1).
B. Prabhakaran 10
Lamport’s Algorithm: Example
Step 1:
S1 (2,1)
S2
(1,2)
S3
Step 2:
S1 (1,2) (2,1)
S2 enters CS
S2
(1,2) (2,1)
S3
(1,2) (2,1)
B. Prabhakaran 11
Lamport’s: Example…
Step 3:
S1 (1,2) (2,1)
S2 leaves CS
S2
(1,2) (2,1)
S3
(1,2) (2,1)
Step 4:
S1 (1,2) (2,1) (2,1)
S1 enters CS
S2
(1,2) (2,1) (2,1)
S3
(1,2) (2,1) (2,1)
B. Prabhakaran 12
Ricart-Agrawala Algorithm
Requesting critical section
Si sends time stamped REQUEST message
Sj sends REPLY to Si, if
Sj is not requesting nor executing CS
If Sj is requesting CS and Si’s time stamp is smaller than its
own request.
Request is deferred otherwise.
B. Prabhakaran 13
Ricart-Agrawala: Performance
Performance:
2(N-1) messages per CS execution. (N-1) REQUEST + (N-1)
REPLY.
Synchronization delay: T.
Optimization:
When Si receives REPLY message from Sj -> authorization to
access CS till
Sj sends a REQUEST message and Si sends a REPLY
message.
Access CS repeatedly till then.
A site requests permission from dynamically varying set of
sites: 0 to 2(N-1) messages.
B. Prabhakaran 14
Ricart-Agrawala: Example
Step 1:
S1 (2,1)
S2
(1,2)
S3
Step 2:
S1
S2 enters CS
S2
(2,1)
S3
B. Prabhakaran 15
Ricart-Agrawala: Example…
Step 3:
S1
S1 enters CS
S2
(2,1)
S2 leaves CS
S3
B. Prabhakaran 16
Maekawa’s Algorithm
A site requests permission only from a subset of sites.
Request set of sites si & sj: Ri, Rj such that Ri and Rj will have
atleast one common site (Sk). Sk mediates conflicts between Ri
and Rj.
A site can send only one REPLY message at a time, i.e., a site
can send a REPLY message only after receiving a RELEASE
message for the previous REPLY message.
Request Sets Rules:
Sets Ri and Rj have atleast one common site.
Si is always in Ri.
Cardinality of Ri, i.e., the number of sites in Ri is K.
Any site Si is in K number of Ris. N = K(K - 1) + 1 -> K = square root of
N.
B. Prabhakaran 17
Maekawa’s Algorithm ...
Requesting CS
Si sends REQUEST(i) to sites in Ri.
Sj sends REPLY to Si if
Sj has NOT sent a REPLY message to any site after it
received the last RELEASE message.
Otherwise, queue up Si’s request.
B. Prabhakaran 18
Request Subsets
Example k = 2; (N = 3).
R1 = {1, 2}; R3 = {1, 3}; R2 = {2, 3}
Example k = 3; N = 7.
R1 = {1, 2, 3}; R4 = {1, 4, 5}; R6 = {1, 6, 7};
R2 = {2, 4, 6}; R5 = {2, 5, 7}; R7 = {3, 4, 7};
R3 = {3, 5, 6}
Algorithm in Maekawa’s paper (uploaded in
Lecture Notes web page).
B. Prabhakaran 19
Maekawa’s Algorithm ...
Performance
Synchronization delay: 2T
Messages: 3 times square root of N (one each for REQUEST,
REPLY, RELEASE messages)
Deadlocks
Message deliveries are not ordered.
Assume Si, Sj, Sk concurrently request CS
Ri intersection Rj = {Sij}, Rj Rk = {Sjk}, Rk Ri = {Ski}
Possible that:
Sij is locked by Si (forcing Sj to wait at Sij)
B. Prabhakaran 20
Handling Deadlocks
Si yields to a request if that has a smaller time stamp.
A site suspects a deadlock when it is locked by a request
with a higher time stamp (lower priority).
Deadlock handling messages:
FAILED: from Si to Sj -> Si has granted permission to higher
priority request.
INQUIRE: from Si to Sj -> Si would like to know Sj has
succeeded in locking all sites in Sj’s request set.
YIELD: from Si to Sj -> Si is returning permission to Sj so that
Sj can yield to a higher priority request.
B. Prabhakaran 21
Handling Deadlocks
REQUEST(tsi,i) to Sj:
Sj is locked by Sk -> Sj sends FAILED to Si, if Si’s request has higher time
stamp.
Otherwise, Sj sends INQUIRE(j) to Sk.
INQUIRE(j) to Sk:
Sk sends a YIELD (k) to Sj, if Sk has received a FAILED message from a
site in Sk’s set. (or) if Sk sent a YIELD and has not received a new REPLY.
YIELD(k) to Sj:
Sj assumes it has been released by Sk, places Sk’s request in its queue
appropriately, sends a REPLY(j) to the top request in its queue.
Sites may exchange these messages even if there is no real
deadlock. Maximum number of messages per CS request: 5 times
square root of N.
B. Prabhakaran 22
Token-based Algorithms
Unique token circulates among the participating sites.
A site can enter CS if it has the token.
Token-based approaches use sequence numbers instead of
time stamps.
Request for a token contains a sequence number.
Sequence number of sites advance independently.
Correctness issue is trivial since only one token is present
-> only one site can enter CS.
Deadlock and starvation issues to be addressed.
B. Prabhakaran 23
Suzuki-Kasami Algorithm
If a site without a token needs to enter a CS, broadcast a REQUEST for
token message to all other sites.
Token: (a) Queue of request sites (b) Array LN[1..N], the sequence
number of the most recent execution by a site j.
Token holder sends token to requestor, if it is not inside CS. Otherwise,
sends after exiting CS.
Token holder can make multiple CS accesses.
Design issues:
Distinguishing outdated REQUEST messages.
Format: REQUEST(j,n) -> jth site making nth request.
Each site has RNi[1..N] -> RNi[j] is the largest sequence number of request
from j.
Determining which site has an outstanding token request.
If LN[j] = RNi[j] - 1, then Sj has an outstanding request.
B. Prabhakaran 24
Suzuki-Kasami Algorithm ...
Passing the token
After finishing CS
(assuming Si has token), LN[i] := RNi[i]
Token consists of Q and LN. Q is a queue of requesting sites.
Token holder checks if RNi[j] = LN[j] + 1. If so, place j in Q.
Send token to the site at head of Q.
Performance
0 to N messages per CS invocation.
Synchronization delay is 0 (if the token holder repeats CS) or T.
B. Prabhakaran 25
Suzuki-Kasami: Example
Step 1: S1 has token, S3 is in queue
Site Seq. Vector RN Token Vect. LN Token Queue
S1 10, 15, 9 10, 15, 8 3
S2 10, 16, 9
S3 10, 15, 9
Step 2: S3 gets token, S2 in queue
Site Seq. Vector RN Token Vect. LN Token Queue
S1 10, 16, 9
S2 10, 16, 9
S3 10, 16, 9 10, 15, 9 2
Step 3: S2 gets token, queue empty
Site Seq. Vector RN Token Vect. LN Token Queue
S1 10, 16, 9
S2 10, 16, 9 10, 16, 9 <empty>
S3 10, 16, 9
B. Prabhakaran 26
Singhal’s Heuristic Algorithm
Instead of broadcast: each site maintains information on other sites, guess the
sites likely to have the token.
Data Structures:
Si maintains SVi[1..M] and SNi[1..M] for storing information on other sites: state
and highest sequence number.
Token contains 2 arrays: TSV[1..M] and TSN[1..M].
States of a site
R : requesting CS
E : executing CS
H : Holding token, idle
N : None of the above
Initialization:
SVi[j] := N, for j = M .. i; SVi[j] := R, for j = i-1 .. 1; SNi[j] := 0, j = 1..N. S1
(Site 1) is in state H.
Token: TSV[j] := N & TSN[j] := 0, j = 1 .. N.
B. Prabhakaran 27
Singhal’s Heuristic Algorithm …
Requesting CS
If Si has no token and requests CS:
SVi[i] := R. SNi[i] := SNi[i] + 1.
Send REQUEST(i,sn) to sites Sj for which SVi[j] = R. (sn: sequence
number, updated value of SNi[i]).
Receiving REQUEST(i,sn): if sn <= SNj[i], ignore. Otherwise, update
SNj[i] and do:
SVj[j] = N -> SVj[i] := R.
SVj[j] = R -> If SVj[i] != R, set it to R & send REQUEST(j,SNj[j]) to
Si. Else do nothing.
SVj[j] = E -> SVj[i] := R.
SVj[j] = H -> SVj[i] := R, TSV[i] := R, TSN[i] := sn, SVj[j] = N.
Send token to Si.
Executing CS: after getting token. Set SVi[i] := E.
B. Prabhakaran 28
Singhal’s Heuristic Algorithm …
Releasing CS
SVi[i] := N, TSV[i] := N. Then, do:
For other Sj: if (SNi[j] > TSN[j]), then {TSV[j] := SVi[j];
TSN[j] := SNi[j]}
else {SVi[j] := TSV[j]; SNi[j] := TSN[j]}
If SVi[j] = N, for all j, then set SVi[i] := H. Else send token to a
site Sj provided SVi[j] = R.
Fairness of algorithm will depend on choice of Si, since no
queue is maintained in token.
Arbitration rules to ensure fairness used.
Performance
Low to moderate loads: average of N/2 messages.
High loads: N messages (all sites request CS).
Synchronization delay: T.
B. Prabhakaran 29
Singhal: Example
Sn R R R R N
Sn R R R R N
S4 R R R N N
S4 R R R N N
S2 R N R N N
S3 R R N N N
S1 N N R N N
S2 R N N N N
N N N N
S1 H N N N N
S3 H
1 2 3 4 n
1 2 3 4 n
(a) Initial Pattern (b) Pattern after S3 gets the token from
S1.
Each row in the matrix has
increasing number of Rs. Stair case is pattern can be identified by
noting that S1 has 1 R and S2 has 2 Rs
and so on. Order of occurrence of R in
a row does not matter.
B. Prabhakaran 30
Singhal: Example…
• Assume there are 3 sites in the system. Initially:
Site 1: SV1[1] = H, SV1[2] = N, SV1[3] = N. SN1[1], SN1[2], SN1[3] are 0.
Site 2: SV2[1] = R, SV2[2] = N, SV2[3] = N. SNs are 0.
Site 3: SV3[1] = R, SV3[2] = R, SV3[3] = N. SNs are 0.
Token: TSVs are N. TSNs are 0.
• Assume site 2 is requesting token.
S2 sets SV2[2] = R, SN2[2] = 1.
S2 sends REQUEST(2,1) to S1 (since only S1 is set to R in SV[2])
• S1 receives the REQUEST. Accepts the REQUEST since SN1[2] is smaller than
the message sequence number.
Since SV1[1] is H: SV1[2] = R, TSV[2] = R, TSN[2] = 1, SV1[1] = N.
Send token to S2
• S2 receives the token. SV2[2] = E. After exiting the CS, SV2[2] = TSV[2] = N.
Updates SN, SV, TSN, TSV. Since nobody is REQUESTing, SV2[2] = H.
• Assume S3 makes a REQUEST now. It will be sent to both S1 and S2. Only S2
responds since only SV2[2] is H (SV1[1] is N now).
B. Prabhakaran 31
Raymond’s Algorithm
Sites are arranged in a logical directed tree. Root: token holder. Edges:
directed towards root.
Every site has a variable holder that points to an immediate neighbor
node, on the directed path towards root. (Root’s holder point to itself).
Requesting CS
If Si does not hold token and request CS, sends REQUEST upwards
provided its request_q is empty. It then adds its request to request_q.
Non-empty request_q -> REQUEST message for top entry in q (if not
done before).
Site on path to root receiving REQUEST -> propagate it up, if its
request_q is empty. Add request to request_q.
Root on receiving REQUEST -> send token to the site that forwarded the
message. Set holder to that forwarding site.
Any Si receiving token -> delete top entry from request_q, send token to
that site, set holder to point to it. If request_q is non-empty now, send
REQUEST message to the holder site.
B. Prabhakaran 32
Raymond’s Algorithm …
Executing CS: getting token with the site at the top of
request_q. Delete top of request_q, enter CS.
Releasing CS
If request_q is non-empty, delete top entry from q, send token to
that site, set holder to that site.
If request_q is non-empty now, send REQUEST message to the
holder site.
Performance
Average messages: O(log N) as average distance between 2 nodes
in the tree is O(log N).
Synchronization delay: (T log N) / 2, as average distance between
2 sites to successively execute CS is (log N) / 2.
Greedy approach: Intermediate site getting the token may enter CS
instead of forwarding it down. Affects fairness, may cause
starvation.
B. Prabhakaran 33
Raymond’s Algorithm: Example
Step 1: Token
S1 holder
Token
S2 S3
request
S4 S5 S6 S7
Step 2:
S1
S2 Token S3
S4 S5 S6 S7
B. Prabhakaran 34
Raymond’s Algm.: Example…
Step 3:
S1
S2 S3
S4 S5 S6 S7
Token
holder
B. Prabhakaran 35
Comparison
Non-Token Resp. Time(ll) Sync. Delay Messages(ll) Messages(hl)
Suzuki-Kasami 2T+E T N N
Singhal 2T+E T N/2 N
Raymond T(log N)+E Tlog(N)/2 log(N) 4
B. Prabhakaran 36
Distributed Deadlock Detection
• Assumptions:
• System has only reusable resources
• Only exclusive access to resources
• Only one copy of each resource
• States of a process: running or blocked
• Running state: process has all the resources
• Blocked state: waiting on one or more resource
B. Prabhakaran 1
Deadlocks
• Resource Deadlocks
• A process needs multiple resources for an activity.
• Deadlock occurs if each process in a set request resources
held by another process in the same set, and it must receive
all the requested resources to move further.
• Communication Deadlocks
• Processes wait to communicate with other processes in a set.
• Each process in the set is waiting on another process’s
message, and no process in the set initiates a message
until it receives a message for which it is waiting.
B. Prabhakaran 2
Graph Models
Nodes of a graph are processes. Edges of a graph the
pending requests or assignment of resources.
Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting
for a resource from P2.
Transaction-wait-for Graphs (TWF): WFG in databases.
Deadlock: directed cycle in the graph.
Cycle example:
P1 P2
B. Prabhakaran 3
Graph Models
Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting
for a resource from P2.
P1 R1
R2 P2
B. Prabhakaran 4
AND, OR Models
AND Model
A process/transaction can simultaneously request for multiple
resources.
Remains blocked until it is granted all of the requested
resources.
OR Model
A process/transaction can simultaneously request for multiple
resources.
Remains blocked till any one of the requested resource is
granted.
B. Prabhakaran 5
Sufficient Condition
Deadlock ??
P1 P2 P5
P4 P3 P6
B. Prabhakaran 6
AND, OR Models
AND Model
Presence of a cycle.
P1 P2 P1
P1 P1 P1
B. Prabhakaran 7
AND, OR Models
OR Model
Presence of a knot.
Knot: Subset of a graph such that starting from any
node in the subset, it is impossible to leave the knot
by following the edges of the graph.
P1 P2 P5
P4 P3 P6
B. Prabhakaran 8
Deadlock Handling Strategies
Deadlock Prevention: difficult
Deadlock Avoidance: before allocation, check for
possible deadlocks.
Difficult as it needs global state info in each site (that handles
resources).
Deadlock Detection: Find cycles. Focus of discussion.
Deadlock detection algorithms must satisfy 2 conditions:
No undetected deadlocks.
No false deadlocks.
B. Prabhakaran 9
Distributed Deadlocks
Centralized Control
A control site constructs wait-for graphs (WFGs) and checks
for directed cycles.
WFG can be maintained continuously (or) built on-demand by
requesting WFGs from individual sites.
Distributed Control
WFG is spread over different sites.Any site can initiate the
deadlock detection process.
Hierarchical Control
Sites are arranged in a hierarchy.
A site checks for cycles only in descendents.
B. Prabhakaran 10
Centralized Algorithms
Ho-Ramamoorthy 2-phase Algorithm
Each site maintains a status table of all processes initiated at that
site: includes all resources locked & all resources being waited on.
Controller requests (periodically) the status table from each site.
Controller then constructs WFG from these tables, searches for
cycle(s).
If no cycles, no deadlocks.
Otherwise, (cycle exists): Request for state tables again.
Construct WFG based only on common transactions in the 2 tables.
If the same cycle is detected again, system is in deadlock.
Later proved: cycles in 2 consecutive reports need not result in a
deadlock. Hence, this algorithm detects false deadlocks.
B. Prabhakaran 11
Centralized Algorithms...
Ho-Ramamoorthy 1-phase Algorithm
Each site maintains 2 status tables: resource status table and
process status table.
Resource table: transactions that have locked or are waiting for
resources.
Process table: resources locked by or waited on by transactions.
Controller periodically collects these tables from each site.
Constructs a WFG from transactions common to both the
tables.
No cycle, no deadlocks.
A cycle means a deadlock.
B. Prabhakaran 12
Distributed Algorithms
Path-pushing: resource dependency information
disseminated through designated paths (in the graph).
Edge-chasing: special messages or probes circulated
along edges of WFG. Deadlock exists if the probe is
received back by the initiator.
Diffusion computation: queries on status sent to process
in WFG.
Global state detection: get a snapshot of the distributed
system. Not discussed further in class.
B. Prabhakaran 13
Edge-Chasing Algorithm
Chandy-Misra-Haas’s Algorithm:
A probe(i, j, k) is used by a deadlock detection process Pi. This
probe is sent by the home site of Pj to Pk.
This probe message is circulated via the edges of the graph.
Probe returning to Pi implies deadlock detection.
Terms used:
Pj is dependent on Pk, if a sequence of Pj, Pi1,.., Pim, Pk
exists.
Pj is locally dependent on Pk, if above condition + Pj,Pk on
same site.
Each process maintains an array dependenti: dependenti(j) is
true if Pi knows that Pj is dependent on it. (initially set to
false for all i & j).
B. Prabhakaran 14
Chandy-Misra-Haas’s Algorithm
Sending the probe:
if Pi is locally dependent on itself then deadlock.
else for all Pj and Pk such that
(a) Pi is locally dependent upon Pj, and
(b) Pj is waiting on Pk, and
(c ) Pj and Pk are on different sites, send probe(i,j,k) to the home
site of Pk.
Performance:
For a deadlock that spans m processes over n sites, m(n-1)/2 messages
are needed.
Size of the message 3 words.
Delay in deadlock detection O(n).
B. Prabhakaran 16
C-M-H Algorithm: Example
P0 P2
P1
P3
probe(1,3,4)
probe(1,7,1)
P4
P7
P6 P5
B. Prabhakaran 17
Diffusion-based Algorithm
Initiation by a blocked process Pi:
send query(i,i,j) to all processes Pj in the dependent set DSi of Pi;
num(i) := |DSi|; waiti(i) := true;
B. Prabhakaran 18
Diffusion Algorithm: Example
reply(1,6,2) query
P2 reply
P1
P3
reply(1,1,7)
query(1,3,4)
query(1,7,1)
P4
P7
P6 P5
B. Prabhakaran 19
Engaging Query
How to distinguish an engaging query?
query(i,j,k) from the initiator contains a unique sequence
number for the query apart from the tuple (i,j,k).
This sequence number is used to identify subsequent queries.
(e.g.,) when query(1,7,1) is received by P1 from P7, P1 checks
the sequence number along with the tuple.
P1 understands that the query was initiated by itself and it is not
an engaging query.
Hence, P1 sends a reply back to P7 instead of forwarding the
query on all its outgoing links.
B. Prabhakaran 20
AND, OR Models
AND Model
A process/transaction can simultaneously request for multiple
resources.
Remains blocked until it is granted all of the requested resources.
Edge-chasing algorithm can be applied here.
OR Model
A process/transaction can simultaneously request for multiple
resources.
Remains blocked till any one of the requested resource is granted.
Diffusion based algorithm can be applied here.
B. Prabhakaran 21
Hierarchical Deadlock Detection
• Follows Ho-Ramamoorthy’s 1-phase algorithm. More than 1 control site
organized in hierarchical manner.
• Each control site applies 1-phase algorithm to detect (intracluster) deadlocks.
• Central site collects info from control sites, applies 1-phase algorithm to
detect intracluster deadlocks.
Control
site
Central Site
Control Control
site site
B. Prabhakaran 22
Persistence & Resolution
Deadlock persistence:
Average time a deadlock exists before it is resolved.
Implication of persistence:
Resources unavailable for this period: affects utilization
Processes wait for this period unproductively: affects response time.
Deadlock resolution:
Aborting at least one process/request involved in the deadlock.
Efficient resolution of deadlock requires knowledge of all processes
and resources.
If every process detects a deadlock and tries to resolve it
independently -> highly inefficient ! Several processes might be
aborted.
B. Prabhakaran 23
Deadlock Resolution
Priorities for processes/transactions can be useful for
resolution.
Consider priorities introduced in Obermarck’s algorithm.
Highest priority process initiates and detects deadlock (initiations
by lower priority ones are suppressed).
When deadlock is detected, lowest priority process(es) can be
aborted to resolve the deadlock.
After identifying the processes/requests to be aborted,
All resources held by the victims must be released. State of
released resources restored to previous states. Released resources
granted to deadlocked processes.
All deadlock detection information concerning the victims must be
removed at all the sites.
B. Prabhakaran 24
Distributed File System
File system spread over multiple, autonomous computers.
A distributed file system should provide:
Network transparency: hide the details of where a file
is located.
High availability: ease of accessibility irrespective of
the physical location of the file.
This objective is difficult to achieve because the distributed
file system is vulnerable to problems in underlying networks
as well as crashes of systems that are the “file sources”.
Replication / mirroring can be used to alleviate the above
problem.
However, replication/mirroring introduces additional issues
such as consistency.
B. Prabhakaran 1
DFS: Architecture
In general, files in a DFS can be located in “any” system.
We call the “source(s)” of files to be servers and those
accessing them to be clients.
Potentially, a server for a file can become a client for
another file.
However, most distributed systems distinguish between
clients and servers in more strict way:
Clients simply access files and do not have/share local files.
Even if clients have disks, they (disks) are used for swapping,
caching, loading the OS, etc.
Servers are the actual sources of files.
In most cases, servers are more powerful machines (in terms of
CPU, physical memory, disk bandwidth, ..)
B. Prabhakaran 2
DFS: Architecture …
… … …
Computer Network
Client Client
B. Prabhakaran 3
DFS Data Access
Request to
Access data
Return data Load data Load server
to client to client cache cache
Check
client
Data Issue disk
cache
present read
Data
Not present Data
Not present
Check Check
Local disk Server cache
Data
(if any) Data
present
present
B. Prabhakaran 6
Name Space Hierarchy
Server X
Root (/)
Mount
Points
a c
b
Server Y Server Z
g
d e f h i
B. Prabhakaran 7
Caching
Performance of distributed file system, in terms of response
time, depends on the ability to “get” the files to the user.
When files are in different servers, caching might be needed
to improve the response time.
A copy of data (in files) is brought to the client (when
referenced). Subsequent data accesses are made on the client
cache.
Client cache can be on disk or main memory.
Data cached may include future blocks that may be
referenced too.
Caching implies DFS needs to guarantee consistency of data.
B. Prabhakaran 8
Hints
Hints can be used when cached data need not be
completely be accurate.
Example: Mapping of the name of a file/directory to the
actual physical device. The address/name of device can be
stored as a hint.
If this address fails to access the requested file, the cached
data can be purged.
The file server can refer to a name server, determine the
actual location of file/directory, and update the cache.
In hints, a cache is neither updated nor invalidated when a
change occurs to the content.
B. Prabhakaran 9
Design Issues
Naming: Locating the file/directory in a DFS based on
name.
Location of cache: disk, main memory, both.
Writing policy: Updating original data source when cache
content gets modified.
Cache consistency: Modifying cache when data source
gets modified.
Availability: More copies of files/resources.
Scalability: Ability to handle more clients/users.
Semantics: Meaning of different operations (read, write,
…)
B. Prabhakaran 10
Naming
Name space: (e.g.,) /home/students/jack, /home/staff/jill.
Name space is a collection of names.
Location transparency: file names do not indicate their
physical locations.
Name resolution: mapping name space to an
object/device/file/directory.
Naming approaches:
Simple Concatenation: add hostname to file names.
Guarantees unique names.
B. Prabhakaran 11
Naming: Approaches ...
Naming approaches:
.....
Mounting: mount remote directories to local ones. Location
transparent after mounting. (followed in Sun NFS).
Example: /students is mounted at /home.
B. Prabhakaran 12
Naming: Context
Context: identifying the name space within which name
resolution is to be done.
Example: context using ~ (tilde).
~jill/t: /home/staff/jill/t
~john/t: /home/students/john/t
~name: represents the directory structure associated with a
person or a project.
Whenever file “t” is accessed, it is interpreted with reference
to ~’s environment.
~ helps when different clients mount in different ways, still
sharing the same of users and their home directories.
(e.g.,) ~john may be mapped to /home/students/john in client
1 and to /usr/students/john in client 2.
B. Prabhakaran 13
Name Resolution
Done by name servers that map file names to actual files.
Centralized name server: send names to the server and get
the path of servers+devices that lead to the requested file.
Name server becomes a bottle neck.
Distributed name server: (e.g.,) consider access to a file
/a/b/c/d/e
Local name server identifies the remote server that handles the
part /b/c/d/e
This procedure may be recursively done till ../e is resolved.
B. Prabhakaran 14
Caching
In main memory:
Faster than disks.
Diskless workstations can also cache.
Server-cache is in main memory -> same design can be used in
clients also.
Disadvantage: clients need main memory for virtual memory
management too.
In disks:
Large files can be cached.
Virtual memory management is straight forward.
After caching the necessary files, the client can get
disconnected from network (if needed, for instance, to help its
mobility).
B. Prabhakaran 15
Writing Policy
When should a modified cache content be transferred to
the server?
Write-through policy:
Immediate writing at server when cache content is modified.
Advantage: reliability, crash of cache (client) does not mean loss
of data.
Disadvantage: Several writes for each small change.
Delayed writing policy:
Write at the server, after a delay.
Advantage: small/frequent changes do not increase network
traffic.
Disadvantage: less reliable, susceptible to client crashes.
Write at the time of file closing.
B. Prabhakaran 16
Cache Consistency
When should a modified source content be transferred to the
cache?
Server-initiated policy:
Server cache manager informs client cache managers that can then
retrieve the data.
Client-initiated policy:
Client cache manager checks the freshness of data before delivering to
users. Overhead for every data access.
Concurrent-write sharing policy:
Multiple clients open the file, at least one client is writing.
File server asks other clients to purge/remove the cached data for the
file, to maintain consistency.
Sequential-write sharing policy: a client opens a file that was
recently closed after writing.
B. Prabhakaran 17
Cache Consistency ...
Sequential-write sharing policy: a client opens a file that
was recently closed after writing.
This client may have outdated cache blocks of the file (since the
other client might have modified the file contents).
Use time stamps for both cache and files. Compare the time
B. Prabhakaran 18
Availability
Intention: overcome the failure of servers or network
links.
Solution: replication, i.e., maintain copies of files at
different servers.
Issues:
Maintaining consistency
Detecting inconsistencies, if they happen despite best efforts.
Possible reasons for such inconsistencies:
Replica is not updated due to a server failure or a broken
network link.
Inconsistency problems and their recovery may reduce the
benefit of replication.
B. Prabhakaran 19
Availability: Replication
Unit of replication: is mostly a file.
Replicas of a file in a directory may be handled by different
servers, requiring extra name resolutions to locate the replicas.
Replication unit: group of files:
Advantage: process of name resolution, etc., to locate replicas
can be done for a set of files and not for individual files.
Disadvantage: wasteful of disk space if only very few of this
group of files is needed by users often.
B. Prabhakaran 20
Replica Management
Two-phase commit protocols can be used to update all
replicas.
Other schemes:
Weighted votes:
A certain number of votes r or w is to be obtained before
reading or writing.
Current synchronization site (CSS):
Designate a process/site to control the modifications.
B. Prabhakaran 21
Scalability
Ease of adding more servers and clients with respect to the
problems / design issues discussed before such as caching,
replication management, etc.
Server-initiated cache invalidation scales up better.
Using the clients cache:
A server serves only X clients.
New clients (after the first X) are informed of the X clients from
whom they can get the data (sort of chaining/hierarchy).
Cache misses & invalidations are propagated up and down this
hierarchy, i.e., each node serves as a mini-file server for its
children.
Structure of a server:
I/O operations through threads (light weight processes) can help in
handling more clients.
B. Prabhakaran 22
Semantics
What is the effect / meaning of an operation?
(e.g.,) read returns the data due to latest write operation.
Guaranteeing the above semantics in the presence of
caching can be difficult.
We saw techniques for these under caching.
B. Prabhakaran 23
Case Study: Sun NFS
Major goal: keep the distributed file system independent of
underlying hardware and operating system.
NFS (Network File System): uses the Remote Procedure
Call (RPC) for remote file operations.
Virtual file system (VFS) interface: provides uniform,
virtual file operations that are mapped to the actual file
system. (e.g.,) VFS can be mapped to DOS, so NFS can
work with PCs.
VFS uses a structure called vnode (virtual node) that is
unique in a NFS.
Each vnode has a mount table that provides a pointer to its
parent file system and to the system over which it is
mounted.
B. Prabhakaran 24
Sun NFS...
A vnode can be a mount point.
Using mount tables, VFS interface can distinguish
between local and remote file systems.
Requests to remote files are routed to the NFS by the VFS
interface.
RPCs are used to reach remote VFS interface.
Remote VFS invokes appropriate local file operation.
B. Prabhakaran 25
Sun NFS Architecture
Client
Kernel
OS Interface Server
Network
B. Prabhakaran 26
NFS: Naming & Location
Each client can configure its file system independent of
others. i.e., different clients can see different name spaces.
Name resolution example:
Look up for a/b/c. a corresponds to vnode1 (assume).
Look up on vnode1/b returns vnode2 that might say the object is
on server X.
Look up on vnode2/c is sent to X. X returns a file handle (if the
file exists, permission matches, etc).
File handle is used for subsequent file operations.
Name resolution in NFS is an iterative process (slow).
Name space information is not maintained at each server as
the servers in NFS are stateless (to be discussed later).
B. Prabhakaran 27
NFS: Caching
NFS Client Cache:
File blocks: cached on demand.
Employs read ahead. Large block sizes (8 Kbytes) for data transfer
cached.
Cached blocks are valid for certain period after which validation is
B. Prabhakaran 28
NFS: Caching
NFS Client Cache: ....
Attributes of files & directories:
Attribute inquiries form 90% of calls made to servers.
B. Prabhakaran 29
NFS: Stateless Server
NFS servers are stateless to help crash recovery.
Stateless: no record of past requests (e.g., whether file is open,
position of file pointer, etc.,).
Client requests contain all the needed information. No
response, client simply re-sends the request.
After a crash, a stateless server simply restarts. No need to:
Restore previous transaction records.
Update clients or negotiate with clients on file status.
Disadvantages:
Client message sizes are larger.
Server cache management difficult since server has no idea on which
files have been opened/closed.
Server can provide little information for file sharing.
B. Prabhakaran 30
Un/mounting in NFS
Mounting of files in Unix is done by using a mount table
stored in a file: /etc/mnttab.
mnttab is read by programs using procedures such as
getmntent.
mount command adds an entry in mnttab, i.e., every time a
file system is mounted in the system.
umount command removes an entry in mnttab, i.e., every
time a file system is unmounted from the system.
B. Prabhakaran 31
Un/Mounting
First entry in mnttab: file system that was mounted first.
Usually, file systems get mounted at boot time.
Mount: term used for mounting tapes onto systems, I guess.
Each entry is a line of fields separated by spaces in the form:
<special> <mount_point> <fstype> <options> <time>
<special>: The name of the resource to be mounted.
<mount_point> : pathname of the directory on which the filesystem is
mounted.
<fstype> : file system type of the mounted file system.
<options> : mount options.
<time> : time at which the file system was mounted.
Entries for <special>: path-name of a block-special device (e.g.,
/dev/fd0), the name of a remote filesystem (casa:/export/home, i.e.,
host:pathname), or the name of a swap file.
B. Prabhakaran 32
Sharing Filesystems
In SunOS, share command is used to specify the file systems that can be
mounted by other systems.
(e.g.), share [ -F FSType ] [ -o specific_options ] [-d
description ] [ pathname ]
Share command makes a resource available to remote system, through a
file system of FSType.
<specific_options> : control access of the shared resource.
rw pathname is shared read/write to all clients. This is also the default
behavior.
rw=client[:client]...pathname is shared read/write only to the listed clients.
No other systems can access pathname.
ro pathname is shared read-only to all clients.
ro=client[:client]... pathname is shared read-only only to the listed
clients. No other systems can access pathname.
B. Prabhakaran 33
Sharing Filesystems…
<-d description>: -d flag may be used to provide a description of the
resource being shared.
Example : To share the /disk file system read-only at boot time.
share -F nfs -o ro /disk
share -F nfs -o rw=usera:userb /somefs
Multiple share commands on same file system? : Last command
supersedes.
Try:
/etc/dfs/dfstab: list of share commands to be executed at boot time
/etc/dfs/fstypes: list of file system types, NFS by default
/etc/dfs/sharetab: system record of shared file systems.
B. Prabhakaran 34
Automounting
mount a remote file system only when it is accessed, perhaps for a
guessed duration of time.
automount utility: installs autofs mount points and associates an
automount map with each mount point.
autofs file system monitors attempts to access directories within it and
notifies the automountd daemon.
automountd uses the map to locate a file system. Then mounts at the
point of reference within the autofs file system.
A map can be assigned to an autofs mount using an entry in the
/etc/auto_master map or a direct map.
File system is not accessed within an appropriate interval (10
minutes by default) ? : the automountd daemon unmounts the file
system.
B. Prabhakaran 35
Cluster File System
System Model: a set of storage devices that can be
accessed by a set of workstations.
System 1 System n
B. Prabhakaran 37
Storage Virtualization
Means logical representation of the physical resources:
storage devices & workstations
Virtualization specifies details such as which devices are
meant for which host, how they can be shared, etc.
Possible places for virtualization: (each choice has its own
advantages and disadvantages)
Workstations or hosts
Volume managers (software) are run on hosts,
B. Prabhakaran 38
Storage Virtualization...
Possible places for virtualization:
In storage subsystem
Associated with large-scale RAID large subsystems
virtualization
Appliance name: NAS (Network Attached Storage)
B. Prabhakaran 39
Veritas Volume Manager
Works on both Unix and Windows
Builds a diskgroup spanning multiple devices.
Dynamic diskgroups management
Striping of data on multiple RAIDs.
Striping distributes data on multiple disks and hence
increases the disk bandwidth for retrieval. Suitable for
multimedia data.
Cluster Volume Manager:
Allows a volume to be simultaneously mounted for use across
multiple servers for both reads and writes.
B. Prabhakaran 40
Veritas Cluster Server
Cluster server handles upto 32 systems.
It monitors, controls and restarts applications in response
to a variety of task.
(e.g.,) application A1 may be started on system n is system
1 fails. Disk group D1 will be automatically assigned to
system n.
(e.g.) Disk group D2 may be assigned to system 1 if D1
fails and the application A1 will continue.
S1 Sn
D1 D2
B. Prabhakaran 41
Service Groups
A set of resources working together to provide application
services to clients.
Service group example:
Disk groups having data
Volume built using disk group
File system (directories) using the volume
Servers/systems providing the application
Application program + libraries
Types of Service Groups:
Failover Groups: runs on 1 system in a cluster at a time. Used for
applications that are not designed to maintain data consistency on
multiple copies.
Cluster server monitors the heart beat of the system. If it fails,
the backup is brought on-line.
B. Prabhakaran 42
Service Groups...
Types of Service Groups...:
Parallel groups: run concurrently on more than 1 system.
Time-to-recovery:
On a failure, an application service is moved to another server in
the cluster.
Disk groups are de-imported from the crashed server and
imported by the back-up server.
Volume manager helps to manage the disk group ownership and
accelerate recovery process of the cluster.
New ownership properties are broadcast to the cluster to ensure
data security.
Time to take to bring the back-up online.
B. Prabhakaran 43
Disaster Tolerance
More than 1 cluster connected by very high speed networks
over a wide area network.
Cluster 1 and 2 geographically distributed.
Cluster 1
Cluster 2
B. Prabhakaran 44
Veritas Volume Replicator
Redundant copy of application in another cluster must be
kept up-to-date.
Volume Replicator allows a disk group to be replicated at 1
or more remote clusters.
Initialization of replication: entire disk group is replicated.
Runtime: only modifications to data are communicated.
Conserves network bandwidth.
Disk groups at the remote cluster are not usually active.
Identical instance of application is run on the remote cluster
in idle mode.
Disaster is identified by volume replicator using heart beats.
Puts remote cluster on-line for the applications.
Time-to-recovery: less than 1 minute.
B. Prabhakaran 45
Distributed Shared Memory
DSM provides a virtual address space that is shared among
all nodes in the distributed system.
Programs access DSM just as they do locally.
An object/data can be owned by a node in DSM. Initial
owner can be the creator of the object/data.
Ownership can change when data moves to other nodes.
A process accessing a shared object gets in touch with a
mapping manager who maps the shared memory address to
the physical memory.
Mapping manager: a layer of software, perhaps bundled
with the OS or as a runtime library routine.
B. Prabhakaran 1
Distributed Shared Memory...
Node 1 Node 2 Node n
Shared Memory
B. Prabhakaran 2
DSM Advantages
Parallel algorithms can be written in a transparent manner
using DSM. Using message passing (e.g., send, receive),
the parallel programs might become even more complex.
Difficult to pass complex data structures with message
passing primitives.
Entire block/page of memory along with the reference
data /object can be moved. This can help in easier
referencing of associated data.
DSM cheaper to build compared to tightly coupled
multiprocessor systems.
B. Prabhakaran 3
DSM Advantages…
Fast processors and high speed networks can help in
realizing large sized DSMs.
Programs using large DSMs may not need as many disk swaps as
in the case of local memory usage.
This can offset the overhead due to communication delay in
DSMs.
Tightly coupled multiprocessor systems access main
memory via a common bus. So the number of processors
limited to a few tens. No such restriction in DSM systems.
Programs that work on multiprocessor systems can be
ported or directly work on DSM systems.
B. Prabhakaran 4
DSM Algorithms
Issues:
Keeping track of remote data locations
Overcoming/reducing communication delays and protocol
overheads when accessing remote data.
Making shared data concurrently available to improve
performance.
Types of algorithms:
Central-server
Data migration
Read-replication
Full-replication
B. Prabhakaran 5
Central Server Algorithm
Central Server
Data
Access
Requests
Clients
B. Prabhakaran 6
Migration Algorithm
Data Access Request
Node i Node j
Data Migration
B. Prabhakaran 7
Migration Algorithm …
Migration algorithm can be combined with virtual memory.
(e.g.,) if a page fault occurs, check memory map table.
If map table points to a remote page, migrate the page
before mapping it to the requesting process’s address space.
Several processes can share a page at a node.
Locating remote page:
Use a server that tracks the page locations.
Use hints maintained at nodes. Hints can direct the search for a
page toward the node holding the page.
Broadcast a query to locate a page.
B. Prabhakaran 8
Read-replication Algorithm
• Extend migration algorithm: replicate data at multiple nodes for read
access.
• Write operation:
• invalidate all copies of shared data at various nodes.
• (or) update with modified value
Data Access Request
Write Operation
in Read-replication
Algorithm : Node i Node j
Data Replication
Invalidate
• DSM must keep track of the location of all the copies of shared data.
• Read cost low, write cost higher.
B. Prabhakaran 9
Full-replication Algorithm
• Extension of read-replication algorithm: allows multiple sites to have both
read and write access to shared data blocks.
• One mechanism for maintaining consistency: gap-free sequencer.
Sequencer
Write Operation
Update
in Full-replication Write
Multicast
Algorithm : Requests
Clients
B. Prabhakaran 10
Full-replication Sequencer
Nodes modifying data send request (containing
modifications) to sequencer.
Sequencer assigns a sequence number to the request and
multicasts it to all nodes that have a copy of the shared data.
Receiving nodes process requests in order of sequence
numbers.
Gap in sequence numbers? : modification requests might be
missing.
Missing requests will be reported to sequencer for
retransmission.
B. Prabhakaran 11
Memory Coherence
Coherence: value returned by a read operation is the one
expected by a programmer (e.g., the value of the latest write
operation).
Strict Consistency: a read returns the most recently written
value.
Requires total ordering of requests which implies significant
overhead for mechanisms such as synchronization.
Sometimes strict consistency may not be needed.
Sequential Consistency: result of any execution of the
operations of all processors is the same as if they were
executed in a sequential order + the operations of each
processor appear in this sequence, in the order specified by
the program.
B. Prabhakaran 12
Memory Coherence…
General Consistency: all copies of a memory location have
the same data after all writes of every processor is over.
Processor Consistency: Writes issued by a processor
observed in the same order in which they were issued.
(ordering among any 2 processors may be different).
Weak Consistency: Synchronization operations are
guaranteed to be sequentially consistent.
i.e., treat shared data as critical sections. Use synchronization/
mutual exclusion techniques to access shared data.
Maintaining consistency: programmer’s responsibility.
Release Consistency: synchronization accesses are only
processor consistent with respect to each other.
B. Prabhakaran 13
Coherence Protocols
Write-invalidate Protocol: invalidate all copies except the
one being modified before the write can proceed.
Once invalidated, data copies cannot be used.
Disadvantages:
invalidation sent to all nodes regardless of whether the nodes
will be using the data copies.
Inefficient if many nodes frequently refer to the data: after
invalidation message, there will be many requests to copy the
updated data.
Used in several systems including those that provide strict
consistency.
Write-update Protocol: causes all copies of shared data to be
updated. More difficult to implement, guaranteeing
consistency may be more difficult. (reads may happen in
between write-updates).
B. Prabhakaran 14
Granularity & Replacement
Granularity: size of the shared memory unit.
For better integration of DSM and local memory management:
DSM page size can be multiple of the local page size.
Integration with local memory management provides built-in
protection mechanisms to detect faults, to prevent and and recover
from inappropriate references.
Larger page size:
More locality of references.
Less overhead for page transfers.
Disadvantage: more contention for page accesses.
Smaller page size:
Less contention. Reduces false sharing that occurs when 2
different data items are not shared by 2 different processors
but contention occurs as they are on same page.
B. Prabhakaran 15
Granularity & Replacement…
Page replacement needed as physical/main memory is
limited.
Data may be used in many modes: shared, private, read-
only, writable,…
Least Recently Used (LRU) replacement policy cannot be
directly used in DSMs supporting data movement. Modified
policies more effective:
Private pages may be removed ahead of shared ones as shared
pages have to be moved across the network
Read-only pages can be deleted as owners will have a copy
A page to be replaced should not be lost for ever.
Swap it onto local disk.
Send it to the owner.
Use reserved memory in each node for swapping.
B. Prabhakaran 16
Case Studies
Cache coherence protocol:
PLUS System
Munin System
General Distributed Shared Memory:
IVY (Integrated Shared Virtual Memory at Yale)
B. Prabhakaran 17
PLUS System
PLUS system: write-update protocol and supports general
consistency.
Memory Coherence Manager (MCM) manages cache.
Unit of replication: a page (4 Kbytes); unit of memory
access and coherence maintenance: one 32-bit word.
A virtual page corresponds to a list of replicas of a page.
One of the replica is designated as master copy.
Distributed link list (copy-list) identifies the replicas of a
page. Copy-list has 2 pointers
Master pointer
Next-copy pointer
B. Prabhakaran 18
PLUS: RW Operations
Read fault?:
If address points to local memory, read it. Otherwise, local MCM
sends a read request to its counterpart at the specified remote
node.
Data returned by remote MCM passed back to the requesting
processor.
Write operation:
Always on master copy then propagated to copies linked by the
copy-list.
Write fault: memory location for write, not local.
On write fault: update request sent to the remote node pointed to
by MCM.
If the remote node does not have the master copy, update request
sent to the node with master copy and for further propagation.
B. Prabhakaran 19
PLUS Write-update Protocol
Master Next-copy Master Next-copy Master Next-copy
X X X =1 on Nil
=1 on 2 =1 on 3
2
3 2 7
6
4 5 8
4
6
Node 1 Node 3
Node 2
1. MCM sends write req
to node 2.
1 8
2. Update message to
master node
X Node 2 Page p
3. MCM updates X
4. Update message to next 7. Update X
copy. 1 8. MCM sends ack:
5. MCM updates X Update complete.
6. Update message to next Node 4
copy
B. Prabhakaran 20
PLUS: Protocol
Node issuing write is not blocked on write operation.
However, a read on that location (being written into) gets
blocked till the whole update is completed. (i.e., remember
pending writes).
Strong ordering within a single processor independent of
replication (in the absence of concurrent writes by other
processors), but not with respect to another processor.
write-fence operation: strong ordering with synchronization
among processors. MCM waits for previous writes to
complete.
B. Prabhakaran 21
Munin System
Use application-specific semantic information to classify shared
objects. Use class-specific handlers.
Shared object classes:
Write-once objects: written at the start, read many times after that.
Replicated on-demand, accessed locally at each site. Large object?
Portions can be replicated instead of whole object.
Private objects: accessed by a single thread. Not managed by coherence
manager unless accessed by a remote thread.
Write-many objects: modified by multiple threads between
synchronization points. Munin employs delayed updates. Updates are
propagated only when thread synchronizes. Weak consistency.
Result objects: Assumption is concurrent updates to different parts of a
result object will not conflict and object is not read until all parts are
updated -> delayed update can be efficient.
Synchronization objects: (e.g.,) distributed locks for giving exclusive
access to data objects.
B. Prabhakaran 22
Munin System…
Migratory objects: accessed in phases where each phase is a series
of accesses by a single thread: lock + movement, i.,e., migrate to the
node requesting lock.
Producer-consumer objects: written by 1 thread, read by another.
Strategy: move the object to the reading thread in advance.
Read-mostly object: i.e., writes are infrequent. Use broadcasts to
update cached objects.
General read-write objects: does not fall into any of the above
categories: Use Berkeley ownership protocol supporting strict
consistency. Objects can be in states such as:
Invalid: no useful data.
Unowned: has valid data. Other nodes have copies of the object and the
object cannot be updated without first acquiring ownership.
Owned exclusively: Can be updated locally. Can be replicated on-
demand.
Owned non-exclusively: Cannot be updated before invalidating other
copies.
B. Prabhakaran 23
IVY
IVY: Integrated Shared Virtual Memory at Yale. On Apollo
DOMAIN environment on a token ring network.
Granularity of access page: 1KByte.
Address space of a processor: shared virtual memory +
private space.
On a page fault: if the page is not local, it is acquired
through remote memory request and made available to other
processes in the node as well.
Coherence protocol: multiple readers-single writer
semantics Using writer-invalidation protocol, all read-only
copies are invalidated before writing.
B. Prabhakaran 24
Write-invalidation Protocol
Processor i has a write-fault on page p:
i finds owner of p
p’s owner sends the page + copyset (i.e., the processors having
read-only copies). Owner marks its page table entry of p as nil.
i sends invalidation messages to all processors in the copyset.
Processor i has a read fault to a page p:
i finds the owner of p.
p’s owner sends p to i and adds i to the copyset of p.
Implementation schemes:
Centralized manager scheme
Fixed distributed manager scheme
Dynamic distributed manager scheme
Above schemes differ on how the owner of a page is
identified
B. Prabhakaran 25
Centralized Manager
Central manager on a node maintains all page ownership
information.
Page faulting processor contacts the central manager and
requests a page copy.
Central manager forwards the request to the owner. For
write operations, central manager updates the page
ownership information to the requestor.
Page owner sends a page copy to the faulting processor.
For reads, the faulting processor is added to copyset.
For writes, owner is faulting processor now.
Central manager requires 2 messages to locate the page
owner.
B. Prabhakaran 26
Fixed Distributed Manager
Every processor keeps track of a pre-determined set of pages
(determined by a hashing/mapping function H).
Processor i faults on p: i contacts processor H(p) for a copy
of the page.
Rest of the steps proceeds as in centralized manager.
In both the above schemes, concurrent access requests to a
page are serialized at the site of a manager.
B. Prabhakaran 27
Dynamic Distributed Manager
Every host keeps track of the page ownership in its local
page table.
Page table has a column probowner (probable owner) whose
value can either be the true owner or a probable one (i.e., it
is used as a hint). Initially set to a default value.
On a page fault, request is sent to i (assume). If i is the
owner, steps proceed as in the case of central manager.
If not, request is forwarded to probowner of p at i. This is
done until the actual owner is reached.
probowner is updated on receipt of:
Invalidation requests, ownership relinquishing messages.
Receiving or forwarding of a page:
For writes, the receiver is the owner. Forwarder updates the owner as
the receiver.
B. Prabhakaran 28
Dynamic Distributed Manager…
Processor 1 Processor 2 Processor 3 Processor M
Page Prob- Page Prob- Page Prob- Page Prob-
owner owner owner owner
1 1 1 1 1 2 1 2
2 3 2 3 2 3 2 3
Page Page
3 2 3 3 3 3 3 2
Req. Req.
. . . .... .
. . . .
. . . .
. . . .
n k n k n k n k
B. Prabhakaran 29
Dynamic Distributed Manager …
At most (N-1) messages needed to locate the owner.
As hints are updated as a side effect, the average number of
messages should be lower.
Double fault:
Consider a processor p doing a read first and then a write.
p needs to get the page twice.
One solution:
Use sequence numbers along with page transfers.
p can send the its page sequence number to the owner. Owner can
compare the numbers and decide whether a transfer is needed or
not.
Still checking with owner is needed, only transfer of whole page
is avoided.
B. Prabhakaran 30
Memory Allocation
Centralized scheme for memory allocation. A central
manager allocates and deallocates memory.
A 2-level procedure may be more efficient:
Central manager allocates a large chunk to local processors.
Local manager handles local allocations.
B. Prabhakaran 31
Process Synchronization
Needed to serialize concurrent accesses to a page.
IVY uses eventcounts and provides 4 primitives:
Init(ec): initialize an event count.
Read(ec): returns the value of an event count.
Await(ec, value): calling process waits until ec is not equal to the
specified value.
Advance(ec): increments ec by one and wakes up waiting
processes.
Primitives implemented on shared virtual memory:
Any process can use eventcount (after initialization) without
knowing its location.
When the page with eventcount is received by a processor,
eventcount operations are local to that processor and any number
of processes can use it.
B. Prabhakaran 32
Process Synchronization...
Eventcount operations are atomic.
using test-and-set instructions
disallowing transfer of memory pages with eventcounts when
primitives are being executed.
B. Prabhakaran 33
Distributed Scheduling
Motivation: A distributed system may have a mix of
heavily and lightly loaded systems. Hence, migrating a task
to share or balance load can help.
Let P be the probability that the system is in a state in
which at least 1 task is waiting for service and at least 1
server is idle.
Let ρ be the utilization of each server.
We can estimate P using probabilistic analysis and plot a
graph against system utilization.
For moderate system utilization, value of P is high, i.e., at
least 1 node is idle.
Hence, performance can be improved by sharing of tasks.
B. Prabhakaran 1
Distributed Scheduling ...
P
1.0
N = 20
N = 10
N=5
0.3 ρ
0.2 1.0
B. Prabhakaran 2
What is Load?
Load on a system/node can correspond to the queue length
of tasks/ processes that need to be processed.
Queue length of waiting tasks: proportional to task response
time, hence a good indicator of system load.
Distributing load: transfer tasks/processes among nodes.
If a task transfer (from another node) takes a long time, the
node may accept more tasks during the transfer time.
Causes the node to be highly loaded. Affects performance.
Solution: artificially increment the queue length when a task
is accepted for transfer from remote node (to account for the
proposed increased in load).
Task transfer can fail? : use timeouts.
B. Prabhakaran 3
Types of Algorithms
Static load distribution algorithms: Decisions are hard-
coded into an algorithm with a priori knowledge of system.
Dynamic load distribution: use system state information
such as task queue length, processor utilization.
Adaptive load distribution: adapt the approach based on
system state.
(e.g.,) Dynamic distribution algorithms collect load information
from nodes even at very high system loads.
Load information collection itself can add load on the system as
messages need to be exchanged.
Adaptive distribution algorithms may stop collecting state
information at high loads.
B. Prabhakaran 4
Balancing vs. Sharing
Load balancing: Equalize load on the participating nodes.
Transfer tasks even if a node is not heavily loaded so that queue
lengths on all nodes are approximately equal.
More number of task transfers, might degrade performance.
Load sharing: Reduce burden of an overloaded node.
Transfer tasks only when the queue length exceeds a certain
threshold.
Less number of task transfers.
Anticipatory task transfers: transfer from overloaded nodes
to ones that are likely to become idle/lightly loaded.
More like load balancing, but may be less number of transfers.
B. Prabhakaran 5
Types of Task Transfers
Preemptive task transfers: transfer tasks that are partially
executed.
Expensive as it involves collection of task states.
Task state: virtual memory image, process control block, IO
buffers, file pointers, timers, ...
Non-preemptive task transfers: transfer tasks that have not
begun execution.
Do not require transfer of task states.
Can be considered as task placements. Suitable for load sharing
not for load balancing.
Both transfers involve information on user’s current
working directory, task privileges/priority.
B. Prabhakaran 6
Algorithm Components
Transfer policy: to decide whether a node needs to transfer
tasks.
Thresholds, perhaps in terms of number of tasks, are generally used.
(Another threshold can be processor utilization).
When a load on a node exceeds a threshold T, the node becomes a
sender. When it falls below a threshold, it becomes a receiver.
Selection Policy: to decide which task is to be transferred.
Criteria: task transfer should lead to reduced response time, i.e.,
transfer overhead should be worth incurring.
Simplest approach: select newly originated tasks. Transfer costs lower
as no state information is to be transferred. Non-preemptive transfers.
Other factors for selection: smaller tasks have less overhead.
Location-dependent system calls minimal (else, messages need to be
exchanged to perform system calls at the original node).
B. Prabhakaran 7
Algorithm Components...
Location Policy: to decide the receiving node for a task.
Polling is generally used. A node polls/checks whether another is
suitable and willing.
Polling can be done serially or in parallel (using multicast).
Alternative: broadcasting a query, sort of invitation to share load.
Information policy: for collecting system state information.
The collected information is used by transfer, selection, and
location.
Demand-driven Collection: Only when a node is highly or lightly
loaded, i.e., when a node becomes a potential sender or receiver.
Can be sender-initiated, receiver-initiated, or both(symmetric).
B. Prabhakaran 8
System Stability
Unstable system: long term arrival rate of work to a system
is greater than the CPU power.
Load sharing/balancing algorithm may add to the system
load making it unstable. (e.g.,) load information collection
at high system loads.
Effectiveness of algorithm: Effective if it improves the
performance relative to that of a system not using it.
(Effective algorithm cannot be unstable).
Load balancing algorithms should avoid fruitless actions.
(e.g.,) processor thrashing: task transfer makes the receiver
highly loaded, so the task gets transferred again, perhaps
repeatedly.
B. Prabhakaran 9
Load Distributing Algorithms
Sender-initiated: distribution initiated by an overloaded
node.
Receiver-initiated: distribution initiated by lightly loaded
nodes.
Symmetric: initiated by both senders and receivers. Has
advantages and disadvantages of both the approaches.
Adaptive: sensitive to state of the system.
B. Prabhakaran 10
Sender-initiated
Transfer Policy: Use thresholds.
Sender if queue length exceeds T.
Receiver if accepting a task will not make queue length exceed T.
Selection Policy: Only newly arrived tasks.
Location Policy:
Random: Use no remote state information. Task transferred to a node
at random.
No need for state collection. Unnecessary task transfers (processor
Shortest: Poll a set of nodes. Select the receiver with shortest task
queue length.
B. Prabhakaran 11
Sender-initiated
Yes
B. Prabhakaran 12
Sender-initiated
Information Policy: demand-driven.
Stability: can become unstable at high loads.
At high loads, it may become difficult for senders to find
receivers.
Also, the number of senders increase at high system loads
thereby increasing the polling activity.
Polling activity may make the system unstable at high loads.
B. Prabhakaran 13
Receiver-initiated
Transfer Policy: uses thresholds. Queue lengths below T
identifies receivers and those above T identifies senders.
Selection Policy: as before.
Location Policy: Polling.
A random node is polled to check if a task transfer would place its
queue length below a threshold.
If not, the polled node transfers a task.
Otherwise, poll another node till a static PollLimit is reached.
If all polls fail, wait until another task is completed before starting
polling operation.
Information policy: demand-driven.
Stability: Not unstable since there are lightly loaded systems
that have initiated the algorithm.
B. Prabhakaran 14
Receiver-initiated
Yes
Wait for No
some time
B. Prabhakaran 15
Receiver-initiated
Drawback:
Polling initiated by receiver implies that it is difficult to find
senders with new tasks.
Reason: systems try to schedule tasks as and when they arrive.
Effect: receiver-initiated approach might result in preemptive
transfers. Hence transfer costs are more.
Sender-initiated: transfer costs are low as new jobs are
transferred and so no need for transferring task states.
B. Prabhakaran 16
Symmetric
Senders search for receivers and vice-versa.
Low loads: senders can find receivers easily. High loads:
receivers can find senders easily.
May have disadvantages of both: polling at high loads can
make the system unstable. Receiver-initiated task transfers
can be preemptive and so expensive.
Simple algorithm: combine previous two approaches.
Above-average algorithm:
Transfer Policy: Two adaptive thresholds instead of one. If a
node’s estimated average load is A, a higher threshold TooHigh >
A and a lower threshold TooLow < A are used.
Load < TooLow -> receiver. Load > TooHigh -> sender.
B. Prabhakaran 17
Above-average Algorithm
Location policy:
Sender Component
Node with TooHigh load, broadcasts a TooHigh message, sets TooHigh
timer, and listens for an Accept message.
A receiver that gets the (TooHigh) message sends an Accept message,
increases its load, and sets AwaitingTask timer.
If the AwaitingTask timer expires, load is decremented.
On receiving the Accept message: if the node is still a sender, it
chooses the best task to transfer and transfers it to the node.
When sender is waiting for Accept, it may receive a TooLow message
(receiver initiated). Sender sends TooHigh to that receiver. Do step 2 &
3.
On expiration of TooHigh timer, if no Accept message is received,
system is highly loaded. Sender broadcasts a ChangeAverage message.
B. Prabhakaran 18
Above-average Algorithm...
Receiver Component
Node with TooLow load, broadcasts a TooLow message, sets a
TooLow timer, and listens for TooHigh message.
If TooHigh message is received, do step 2 & 3 in Sender
Component.
If TooLow timer expires before receiving any TooHigh message,
receiver broadcasts a ChangeAverage message to decrease the load
estimate at other nodes.
Selection Policy: as discussed before.
Information policy: demand driven. Average load is modified
based on system load. High loads may have less number of
senders progressively.
Average system load is determined individually. There is a range of
acceptable load before trying to be a sender or a receiver.
B. Prabhakaran 19
Adaptive Algorithms
Limit Sender’s polling actions at high load to avoid instability.
Utilize the collected state information during previous polling operations
to classify nodes as: Sender/overloaded, receiver/underloaded, OK (in
acceptable load range).
Maintained as separate lists for each class.
Initially, each node assumes that all others are receivers.
Location policy at sender:
Sender polls the head of the receiver list.
Polled node puts the sender at the head of it sender list. It informs the
sender whether it is a receiver, a sender, or a OK node.
If the polled node is still a receiver, the new task is transferred.
Else the sender updates the polled node’s status, polls the next potential
receiver.
If this polling process fails to identify a receiver, the task can still be
transferred during a receiver-initiated dialogue.
B. Prabhakaran 20
Adaptive Algorithms…
Location policy at receiver
Receivers obtain tasks from potential senders. Lists are scanned in
the following order.
Head to tail in senders list (most up-to-date info used), tail to head in
OK list (least up-to-date used), tail to head in receiver list.
Least up-to-date used in the hope that status might have changed.
Receiver polls the selected node. If the node is a sender, a task is
transferred.
If the node is not a sender, both the polled node and receiver update
each other’s status.
Polling process stops if a sender is found or a static PollLimit is
reached.
B. Prabhakaran 21
Adaptive Algorithms…
At high loads, sender-initiated polling gradually reduces as
nodes get removed from receiver list (and become senders).
Whereas at low loads, sender will generally find some receiver.
At high loads, receiver-initiated works and can find a sender.
At low loads, receiver may not find senders, but that does not affect
the performance.
Algorithm dynamically becomes sender-initiated at low
loads and receiver-initiated at high loads.
Hence, algorithm is stable and can use non-preemptive
transfers at low loads (sender initiated).
B. Prabhakaran 22
Selecting an Algorithm
If a system never gets highly loaded, sender-initiated
algorithms work better.
Stable, receiver-initiated algorithms better for high loads.
Widely fluctuating loads: stable, symmetric algorithms.
Widely fluctuating loads + high migration cost for
preemptive transfers: stable, sender-initiated algorithms.
Heterogeneous work arrival: stable, adaptive algorithms.
B. Prabhakaran 23
Performance Comparison
7
Mean SYM
Response
Time SEND
RECV
1
0.5 1.0
Offered System Load
B. Prabhakaran 24
Implementation Issues
Task placement
A task that is yet to begin is transferred to a remote machine, and
starts its execution there.
Task migration
State Transfer:
State includes contents of registers, task stack, task status
Unfreeze:
Task is installed at the new machine, unfrozen, and is put in the
ready queue.
B. Prabhakaran 25
State Transfer
Issues to be considered:
Cost to support remote execution including delays due to freezing
the task.
Duration of freezing the task should be small. Otherwise it can
B. Prabhakaran 26
State Transfer...
Disadvantages of residual dependencies :
Affects reliability as it becomes dependent on former host
Affects performance since memory accesses can become slow as
pages need to be transferred
Affects complexity as task states are distributed on several hosts.
Location transparency:
Migration should hide the location of tasks
Message passing, process naming, file handling, and other activities
should be transparent of actual location.
Task names and their locations can be maintained as hints. If the
hints fail, they can be updated by a broadcast query or through a
name server.
B. Prabhakaran 27
Recovery
Failure of a site/node in a distributed system causes
inconsistencies in the state of the system.
Recovery: bringing back the failed node in step with other
nodes in the system.
Failures:
Process failure:
Deadlocks, protection violation, erroneous user input, etc.
System failure:
Failure of processor/system. System failure can have full/partial
amnesia.
It can be a pause failure (system restarts at the same state it was
in before the crash) or a complete halt.
Secondary storage failure: data inaccessible.
Communication failure: network inaccessible.
B. Prabhakaran 1
Fault-to-Recovery
Fault
Manufacturing Design External Fatigue
Erroneous
System State
System
failure
B. Prabhakaran 2
Backward & Forward Recovery
Forward Recovery:
Assess damages that could be caused by faults, remove those
damages (errors), and help processes continue.
Difficult to do forward assessment. Generally tough.
Backward Recovery:
When forward assessment not possible. Restore processes to
previous error-free state.
Expensive to rollback states
fault + recovery)
Unrecoverable actions: print outs, cash dispensed at ATMs.
B. Prabhakaran 3
Recovery System Model
For Backward Recovery
A single system with secondary and stable storage
Stable storage does not lose information on failures
Stable storage used for logs and recovery points
Stable storage assumed to be more secure than secondary
storage.
Data on secondary storage assumed to be archived
periodically.
B. Prabhakaran 4
Approaches
Operation-based Approach
Maintaining logs: all modifications to the state of a process are
recorded in sufficient detail so that a previous state can be restored
by reversing all changes made to the state.
(e.g.,) Commit in database transactions: a transaction if it is
committed to by all nodes, then the changes are permanent. If it does
not commit, the effect of transactions are to be undone.
Updating-in-place: Every write (update) results in a log of (1) object
name (2) old object state (3) new state. Operations:
A do operation updates & writes the log
B. Prabhakaran 6
Recovery in Concurrent Systems
Distributed system state involves message exchanges.
In distributed systems, rolling back one process can cause
the roll back of other processes.
Orphan messages & the Domino effect: Assume Y fails
after sending m.
X has record of m at x3 but Y has no record. m -> orphan
message.
Y rolls back to y2 -> X should go to x2.
If Z rolls back, X and Y has to go to x1 and y1 -> Domino effect,
roll back of one process causes one or more processes to roll back.
x1 x2 x3
X
y1 y2 m
Y
Z z2
z1
B. Prabhakaran 7
Lost Messages
If Y fails after receiving m, it will rollback to y1.
X will rollback to x1
m will be a lost message as X has recorded it as sent and Y
has no record of receiving it.
X x1
m
y1
Y X
Failure
B. Prabhakaran 8
Livelocks
X x1
n1
y1 m1
Y X
Failure
X x1
n2
n1
y1 m2
Y X
2nd Rollback
B. Prabhakaran 9
Consistent Checkpoints
Overcoming domino effect and livelocks: checkpoints
should not have messages in transit.
Consistent checkpoints: no message exchange between any
pair of processes in the set as well as outside the set during
the interval spanned by checkpoints.
{x1,y1,z1} is a strongly consistent checkpoint.
x1 x2 x3
X
y1 y2 m
Y
Z z2
z1
B. Prabhakaran 10
Synchronous Approach
Checkpointing:
First phase:
An initiating process, Pi, takes a tentative checkpoint.
Pi requests all other processes to take tentative checkpoints.
Every process informs whether it was able to take checkpoint.
A process can fail to take a checkpoint due to the nature of
application (e.g.,) lack of log space, unrecoverable transactions.
Second phase:
If all processes took checkpoints, Pi decides to make the
checkpoint permanent.
Otherwise, checkpoints are to be discarded.
Pi conveys this decision to all the processes as to whether
checkpoints are to be made permanent or to be discarded.
B. Prabhakaran 11
Assumptions: Synchronous Appr.
Processes communicate by exchanging messages through
communication channels
Channels are FIFO in nature.
End-to-end protocols (e.g. TCP) are assumed to cope with
message loss due to rollback recovery and communication
failure.
Communication failures do not partition the network.
A process is not allowed to send messages between phase 1
and 2.
B. Prabhakaran 12
Synchronous Approach...
Optimization:
Taking a checkpoint is expensive and the algorithm discussed may
take unnecessary checkpoints.
Initiate
checkpointing
x1 x2 x3
X
y1 y2 y3
Y
Z z2 z3
z1
W w2 w3
B. Prabhakaran 13
Synchronous Approach...
Optimization:
Taking a checkpoint is expensive and the algorithm discussed may
take unnecessary checkpoints.
Initiate
checkpointing
x2 x3
X
y2 y3
Y
Z z2 z3
W w2 w3
B. Prabhakaran 14
Checkpointing Optimization
Each process uses monotonically increasing labels in its
outgoing messages.
Notations:
L: largest label. S: smallest label
Let m be the last message X received from Y after X’s last permanent
checkpoint. last_label_recdx[Y] = m.l, if m exists. Otherwise, it is set to
S.
Let m be the first message X sent to Y after checkpointing at X
(permanent or temporary). first_label_sentx[Y] = m.l, if exists. Otherwise,
set to L.
For a checkpointing request to Y, X sends last_label_recdx[Y].
Y takes a temporary checkpoint iff last_label_recdx[Y] >=
first_label_senty[X]. i.e., X has received 1 or more messages after
checkpointing by Y and hence Y should take checkpoint.
ckpt_cohortx = {Y | last_label_recdx[Y] > S}, i.e., the set of all processes
from which X has received messages after its checkpoint.
B. Prabhakaran 15
Checkpointing Optimization
Initial state at all processes p:
first_label_sentp[q] := L.
OK-to_take_ckptp := “yes” if p is willing; “no” otherwise
At initiator Pi:
for all p in ckpt_cohortpi do send Take_a_tentative_ckpt
(Pi,last_label_recdpi[p]) message
if all processes replied “yes”, then for all p in ckpt_cohortpi do
send Make_tentative_ckpt_permanent.
Else send Undo_tentative_ckpt.
At all processes p:
Upon receiving Take_a_tentative_ckpt message from qdo
if OK_to_take_ckptp = “yes” AND last_label_recdq[p] >=
first_label_sentp[q]
take a tentative checkpoint.
B. Prabhakaran 16
Checkpointing Optimization...
At all processes p:
take a tentative checkpoint.
for all processes r in ckpt_cohortp do send Take_a_tentative_ckpt
(p,last_label_recdp[r]) message
if all processes r replied “yes” OK_to_take_ckptp := “yes
else OK_to_take_ckptp := “no”
send (p, OK_to_take_ckptp) to q.
Upon receiving Make_tentative_ckpt_permanent message do
Make tentative checkpoint permanent
for all processes r in ckpt_cohortp do Send
Make_tentative_ckpt_permanent message
Upon receiving Undo_tentative_ckpt message do
Undo tentative checkpoint
for all processes r in ckpt_cohortp do Send Undo_tentative_ckpt
message.
B. Prabhakaran 17
Synchronous Rollback
Rolling back:
First phase:
Pi initiates a rollback asking if all processes are willing to rollback
to the previous checkpoint.
Any process may say no, if it is involved in another recovery
process.
Second phase:
Pi conveys the decision on agreement to all others.
x1 Failure
X x2
X
y1 y2
Y
Z z2
z1
B. Prabhakaran 18
Rollback Optimization
Additional Notation:
last_label_sentx[Y] = m.l, if m exists. Otherwise, set to S.
When X requests Y to restart from the permanent checkpoint, it sends
last_label_sentx[Y] along with its request. Y will restart from its permanent
checkpoint only if: last_label_recdy[X] > last_label_sentx[Y]
roll_cohortx = {Y | X can send messages to Y}
Algorithm:
Initial State at all processes p:
resume_executionp := true;
for all processes q do last_label_recdp[q] := S;
willing_to_rollp = “yes” if p is willing to roll back. “no” otherwise.
At initiator process Pi:
for all p in roll_cohortp do send Prepare_to_rollback (Pi,
last_label_sentPi[p]) message.
B. Prabhakaran 19
Rollback Optimization...
At initiator process Pi...
if all processes reply “yes”, then for all p in roll_cohortp do send
Roll_back message.
else for all p in roll_cohortpi do send Donot_roll_back message.
At all processes p:
Upon receiving Prepare_to_rollback (q,last_label_sentq[p])
message from q do
if willing_to_rollp AND last_label_recdp[q] >
last_label_sentq[p] AND (resume_executionp)
resume_executionp := false;
last_label_sentp[r]) message;
if all r in roll_cohortp replied “yes” then willing_to_rollp :=
“yes”
else willing_to_rollp := “no”
B. Prabhakaran 21
Rollback Optimization...
x1 (3)
X X
(2)
(0)
y1(4)
Y
(3) (0) (3)
Z
z1 (4)
Label
B. Prabhakaran 22
Rollback Optimization...
x1 (3)
X X
(0)
y1(4)
Y
(0)
Z
z1 (3) (4)
Label
B. Prabhakaran 23
Asynchronous Approach
Disadvantages of Synchronous Approach:
Additional message exchanges for taking checkpoints
Delays normal executions as messages cannot be exchanged
during checkpointing.
Unnecessary overhead if no failures occur between checkpoints.
Asynchronous approach: independent checkpoints at each
processor. Identify a consistent set of checkpoints if needed,
for roll backs.
E.g., {x3,y3,z2} not consistent; {x2,y2,z2} consistent. Used for
rollback
x1 x2 x3
X
y1 y2 y3
Y
Z z2
z1
B. Prabhakaran 24
Asynchronous Approach...
Assumption: 2 types of logging.
Volatile logging: takes less time but contents lost on failure. Periodically
flushed to stable logs.
Stable log: may take more time but contents not lost.
Logging: tuple {s, m, msgs_sent}. s process state, m message received,
msgs_sent the set of messages sent during the event.
Event logging initiated on message receipt.
Notations & data structures:
RCVDi<-j (CkPti): Number of messages received by processor Pi from
Pj as per checkpoint CkPti.
SENTi->j(CkPti): Number of messages sent by processor Pi to Pj as per
checkpoint CkPti.
Basic Idea:
Each processor keeps track of the number of messages sent/ received to/
from other processors.
B. Prabhakaran 25
Asynchronous Approach...
Basic Idea ....
Existence of orphan messages identified by comparing the number
of messages sent and received.
If number of received messages > sent messages -> presence of
orphans -> receiving process needs to rollback.
Algorithm:
A recovering processor broadcasts a message to all processors.
if Pi is the recovering processor, CkPti := latest stable log.
else CkPti := latest event that took place in i.
for k := 1 to N do (N the total number of processors in the system)
for each neighboring processor j do send ROLLBACK
(i,SENTi->j(CkPti)) message.
Wait for ROLLBACK message from every neighbor.
B. Prabhakaran 26
Asynchronous Approach...
Algorithm ...
for every ROLLBACK(j,c) message received from a neighbor j, i
does the following:
if RCVDi<-j(CkPti) > c then /* orphans present */
find the latest event e such that RCVDi<-j(e) = c;
CkPti := e.
end for k.
Algorithm has |N| iterations.
During kth (k != 1) iteration, Pi based CkPti determined in (k-1)th
iteration, computes SENTi->j(CkPti) for each neighbor.
This value is sent in a ROLLBACK message (in kth iteration)
At the end of each iteration, at least 1 processor will roll back to its
final recovery point.
B. Prabhakaran 27
Asynch. Approach Example
x1 ex1 ex2 ex3
X
B. Prabhakaran 29
Distributed Databases
Checkpointing objectives in distributed database systems
(DDBS):
Normal operations should be minimally interfered with, by
checkpointing.
A DDBS may update different objects in different sites, local
checkpointing at each site is better.
For faster recovery, checkpoints be consistent (desirable property).
Activity in DDBS is in terms of transactions. So in DDBS, a
consistent checkpoint should either include updates of a
transaction completely or not include it all.
Issues in identifying checkpoints:
How sites agree on what transactions are to be included
Taking checkpoints without interference
B. Prabhakaran 30
DDBS Checkpointing
Assumptions:
Basic unit of activity is transactions
Transactions follow some concurrency control protocol
Lamport’s logical clocks used for time-stamping transactions.
Failures detected by network protocols or timeouts
Network partitioning never occurs
Basic Idea
All sites agree on a Global Checkpoint Number (GCPN)
Transactions with timestamps <= GCPN are included in the
checkpoint. Called BCPTs: Before Checkpoint Transactions.
Timestamps of After Checkpoint Transactions (ACPTs) > GCPN.
Each site multiple versions of data items being updated by ACPTs
in volatile storage -> No interference during checkpointing.
B. Prabhakaran 31
DDBS Checkpointing ...
Data Structures
LC: local clock as per Lamport’s logical clock
LCPN (local checkpoint number): determined locally for the
current checkpoint.
Algorithm: initiated by checkpoint coordinator (CC). CC
uses checkpoint subordinates (CS).
Phase 1 at the CC
CC broadcasts a Checkpoint_Request message with a local
timestamp LCcc.
LCPNcc := LCcc
CONVERTcc := false
Phase 1 at CSs
B. Prabhakaran 32
DDBS Checkpointing ...
Phase 1 at CSs
On receiving a Checkpoint_Request message, a site m, updates its
local clock as LCm := MAX(LCm, LCcc+1)
LCPNm := LCm
m informs LCPNm to the CC
CONVERTm := false
m marks all the transactions with timestamps !> LCPNm as BCPTs
and the rest as temporary-ACPTs.
All updates of temporary-ACPTs are stored in the buffers of the
ACPTs
If a temporary-ACPT commits, updates are not flushed to the
database but maintained as committed temporary versions (CTVs).
Other transactions access CTVs for reads. For writes, another
version of CTV is created.
B. Prabhakaran 33
DDBS Checkpointing ...
Phase 2 at CC
All CS’s replies received -> GCPN := Max(LCPN1, .., LCPNn)
Broadcast GCPN
Phase 2 at the CSs
On receiving GCPN, m marks all temporary-ACPTs that satisfy the
following conditions as BCPTs:
LCPNm < transaction time stamp <= GCPN
Updates of the above converted BCPTs are included in checkpoints
CONVERTm := true (i.e., GCPN & BCPTs identified)
When all BCPTs terminate and CONVERTm = true, m takes a local
checkpoint by saving the state of the data objects.
After local checkpointing, database is updated with CTVs and
CTVs are deleted.
B. Prabhakaran 34
Fault Tolerance
Recovery: bringing back the failed node in step with other
nodes in the system.
Fault Tolerance: Increase the availability of a service or the
system in the event of failures. Two ways of achieving it:
Masking failures: Continue to perform its specified function in the
event of failure.
Well defined failure behavior: System may or may not function in
the event of failure, but can facilitate actions suitable for recovery.
(e.g.,) effect of database transactions visible only if committed
to by all sites. Otherwise, transaction is undone without any
effect to other transactions.
Key approach to fault tolerance: redundancy. e.g., multiple
copies of data, multiple processes providing same service.
Topics to be discussed: commit protocols, voting protocols.
B. Prabhakaran 1
Atomic Actions
Example: Processes P1 & P2 share a data named X.
P1: ... lock(X); X:= X + Z; unlock(X); ...
P2: ... lock(X); X := X + Y; unlock(X); ...
Updating of X by P1 or P2 should be done atomically i.e.,
without any interruption.
Atomic operation if:
the process performing it is not aware of existence of any others.
the process doing it does not communicate with others during the
operation time.
No other state change in the process except the operation.
Effects on the system gives an impression of indivisible and
perhaps instantaneous operation.
B. Prabhakaran 2
Committing
A group of actions is grouped as a transaction and the group
is treated as an atomic action.
The transaction, during the course of its execution, decides
to commit or abort.
Commit: guarantee that the transaction will be completed.
Abort: guarantee not to do the transaction and erase any part
of the transaction done so far.
Global atomicity: (e.g.,) A distributed database transaction
that must be processed at every or none of the sites.
Commit protocols: are ones that enforce global atomicity.
B. Prabhakaran 3
2-phase Commit Protocol
Distributed transaction carried out by a coordinator + a set of
cohorts executing at different sites.
Phase 1:
At the coordinator:
Coordinator sends a COMMIT-REQUEST message to every cohort
requesting them to commit.
Coordinator waits for reply from all others.
At the cohorts:
On receiving the request: if the transaction execution is successful,
the cohort writes UNDO and REDO log on stable storage. Sends
AGREED message to coordinator.
Otherwise, sends an ABORT message.
Phase 2:
At the coordinator:
All cohorts agreed? : write a COMMIT record on log, send
COMMIT request to all cohorts.
B. Prabhakaran 4
2-phase Commit Protocol ...
Phase 2...:
At the coordinator...:
Otherwise, send an ABORT message
Coordinator waits for acknowledgement from each cohort.
No acknowledgement within a timeout period? : resend the
commit/abort message to that cohort.
All acknowledgements received? : write a COMPLETE record
to the log.
At the cohorts:
On COMMIT message: resources & locks for the transaction released.
Send Acknowledgement to the coordinator.
On ABO RT message: undo the transaction using UNDO log, release
resources & locks held by the transaction, send Acknowledgement.
B. Prabhakaran 5
Handling failures
2-phase commit protocol handles failures as below:
If coordinator crashes before writing the COMMIT record:
on recovery, it will send ABORT message to all others.
Cohorts who agreed to commit, will simply undo the transaction
using the UNDO log and abort.
Other cohorts will simply abort.
All cohorts are blocked till coordinator recovers.
Coordinator crashes after COMMIT before writing COMPLETE
On recovery, broadcast a COMMIT and wait for ack
Cohort crashes in phase 1? : coordinator aborts the transaction.
Cohort crashes in phase 2? : on recovery, it will check with the
coordinator whether to abort or commit.
Drawback: blocking protocol. Cohorts blocked if coordinator
fails.
Resources and locks held unnecessarily.
B. Prabhakaran 6
2-phase commit: State Machine
Synchronous protocol: all sites proceed in rounds, i.e., a site
never leads another by more than 1 state transition.
A state transition occurs in a process participating in the 2-
phase commit protocol whenever it receives/sends
messages.
States: q (idle or querying state), w (wait), a (abort), c
(commit).
When coordinator is in state q, cohorts are in q or a.
Coordinator in w -> cohort can be in q, w, or a.
Coordinator in a/c -> cohort is in w or a/c.
A cohort in a/c: other cohorts may be in a/c or w.
A site is never in c when another site is in q as the protocol
is synchronous.
B. Prabhakaran 7
2-phase commit: State Machine...
Coordinator Cohort i
qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts
All agreed/ wi ai
One or more abort w1 Commit Abort from
Commit msg received coordinator
reply/ Abort msg to all
sent to all cohorts from
coordinator
a1 c1 ci
B. Prabhakaran 8
Drawback
Drawback: blocking protocol. Cohorts blocked if
coordinator fails.
Resources and locks held unnecessarily.
Conditions that cause blocking:
Assume that only one site is operational. This site cannot decide
to abort a transaction as some other site may be in commit state.
It cannot commit as some other site can be in abort state.
Hence, the site is blocked until all failed sites recover.
B. Prabhakaran 9
Nonblocking Commit
Nonblocking commit? :
Sites should agree on the outcome by examining their local states.
A failed site, upon recovery, should reach the same conclusion
regarding the outcome. Consistent with other working sites.
Independent recovery: if a recovering site can decide on the final
outcome based solely on its local state.
A nonblocking commit protocol can support independent recovery.
Notations:
Concurrency set: Let Si denote the state of the site i. The set of all
the states that may be concurrent with it is concurrency set (C(si)).
(e.g.,) Consider a system having 2 sites.If site 2’s state is w2, then
C(w2) = {c1, a1, w1}. C(q2) = {q1, w1}. a1, c1 not in C(q2) as 2-
phase commit protocol is synchronous within 1 state transaction.
Sender set: Let s be any state, M be the set of all messages
received in s. Sender set, S(s) = {i | site i sends m and m in M}
B. Prabhakaran 10
3-phase Commit
Lemma: If a protocol contains a local state of a site with
both abort and commit states in its concurrency set, then
under independent recovery conditions it is not resilient to
an arbitrary single failure.
In previous figure, C(W2) can have both abort and commit
states in the concurrency set.
To make it a non-blocking protocol: introduce a buffer state
at both coordinator and cohorts.
Now, C(W1) = {q2, w2, a2} and C(w2) = {a1, p1, w1}.
B. Prabhakaran 11
3-phase commit: State Machine
Coordinator Cohort i
qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
message sent to
all cohorts
All agreed/ wi ai
One or more abort w1 Abort from
Prepare msg Prep msg
reply/ Abort msg coordinator
to all received/
sent to all cohorts send Ack
a1 P1 Pi
All cohorts Commit
Ack/ Send Commit received from
msg to all coordinator
c1
ci
B. Prabhakaran 12
Failure, Timeout Transitions
A failure transition occurs at a failed site at the instant it
fails or immediately after it recovers from the failure.
Rule for failure transition: For every non-final state s (i.e., qi, wi,
pi) in the protocol, if C(s) contains a commit, then assign a failure
transition from s to a commit state in its FSA. Otherwise, assign a
failure transition from s to an abort state.
Reason: pi is the only state with a commit state in its concurrency
set. If a site fails at pi, then it can commit on recovery. Any other
state failure, safer to abort.
If site i is waiting on a message from j, i can time out. i can
determine the state of j based on the expected message.
Based on j’s state, the final state of j can be determined
using failure transition at j.
B. Prabhakaran 13
Failure, Timeout Transitions
This can be used for incorporating Timeout transitions at i.
Rule for timeout transition: For each nonfinal state s, if site j in
S(s),and site j has a failure transition from s to a commit (abort)
state, then assign a timeout transition from s to a commit (abort)
state.
Reason:
Failed site makes a transition to a commit (abort) state using failure
transition rule.
So, the operational site must make the same transition to ensure that
the final outcome is the same at all sites.
B. Prabhakaran 14
3-phase commit + Failure Trans.
Coordinator Cohort i
qi
q1 C_R received/ C_R received/
Commit_Request Agreed msg sent Abort msg sent
F,T message sent to F,T
all cohorts F,T
wi ai
One or more abort w1 All agreed/ Abort from
Prepare msg Prep msg
reply/ Abort msg coordinator
to all received/
sent to all cohorts F,T send Ack Abort from
T P1 coordinator
a1 Pi
Abort to all All cohorts
cohorts Commit
Ack/ Send Commit F,T received from
F
msg to all coordinator
c1
F: Failure Transition ci
T: Timeout Transition
F,T: Failure/Timeout
B. Prabhakaran 15
Nonblocking Commit Protocol
Phase 1:
First phase identical to that of 2-phase commit, except for failures.
Here, coordinator is in w1 and each cohort is in a or w or q,
depending on whether it has received the commit_request message
or not.
Phase 2:
Coordinator sends a Prepare message to all the cohorts (if all of
them sent Agreed message in phase 1).
Otherwise, it will send an Abort message to them.
On receiving a Prepare message, a cohort sends an
acknowledgement to the coordinator.
If the coordinator fails before sending a Prepare message, it
aborts the transaction on recovery.
Cohorts, on timing out on a Prepare message, also aborts the
transaction.
B. Prabhakaran 16
Nonblocking Commit Protocol
Phase 3:
On receiving acknowledgements to Prepare messages, the coordinator
sends a Commit message to all cohorts.
Cohort commits on receiving this message.
Coordinator fails before sending commit? : commits upon
recovery.
So cohorts on Commit message timeout, commit to the transaction.
Cohort failed before sending an acknowledgement? : coordinator
times out and sends an abort message to all others.
Failed cohort aborts the transaction upon recovery.
B. Prabhakaran 17
Commit Protocols Disadvantages
No protocol using the above independent recovery technique
for simultaneous failure of more than 1 site.
The above protocol is also not resilient to network
partitioning.
Alternative: Use voting protocols.
Basic idea of voting protocol:
Each replica assigned some number of votes
A majority of votes need to be collected before accessing a replica.
Voting mechanism: more fault tolerant to site failures, network
partitions, and message losses.
Types of voting schemes:
Static
Dynamic
B. Prabhakaran 18
Static Voting Scheme
System Model:
File replicas at different sites. File lock rule: either one writer + no
reader or multiple readers + no writer.
Every file is associated with a version number that gives the number
of times a file has been updated.
Version numbers are stored on stable storage. Every successful write
updates version number.
Basic Idea:
Every replica assigned a certain number of votes. This number stored
on stable storage.
A read or write operation permitted if a certain number of votes, called
read quorum or write quorum, are collected by the requesting process.
Voting Algorithm:
Let a site i issue a read or write request for a file.
Site i issues a Lock_Request to its local lock manager.
B. Prabhakaran 19
Static Voting ...
Voting Algorithm...:
When lock request is granted, i sends a Vote_Request message to
all the sites.
When a site j receives a Vote_Request message, it issues a
Lock_Request to its lock manager. If the lock request is granted,
then it returns the version number of the replica (VNj) and the
number of votes assigned to the replica (Vj) at site i.
Site i decides whether it has the quorum or not, based on replies
received within a timeout period as follows.
For read requests, Vr = Sum of Vk, k in P, where P is the set of
sites from which replies were received.
For write requests, Vw = Sum of Vk, k in Q such that:
M = max{VN j: j is in P}
Q = {j in P : VNj = M}
Only the votes of the current (version) replicas are counted in
deciding the write quorum.
B. Prabhakaran 20
Static Voting ...
Voting Algorithm...:
If i is not successful in getting the quorum, it issues a Release
_Lock to the lock manager & to all sites that gave their votes.
If i is successful in collecting the quorum, it checks whether its
copy of file is current (VNi = M). If not, it obtains the current
copy.
If the request is read, i reads the local copy. If write, i updates the
local copy and VN.
i sends all updates and VNi to all sites in Q, i.e., update only
current replicas. i sends a Release_Lock request to its lock
manager as well as those in P.
All sites on receiving updates, perform updates. On receiving
Release_Lock, releases lock.
Vote Assignment:
Let v be the total number of votes assigned to all copies. Read &
write quorum, r & w, are selected such that: r + w > v; w > v/2.
B. Prabhakaran 21
Static Voting ...
Vote Assignment ...:
Above values are determined so that there is a non-null
intersection between every read and write quorum, i.e., at least 1
current copy in any reading quorum gathered.
Write quorum is high enough to disallow simultaneous writes on 2
distinct subset of replicas.
The scheme can be modified to collect write quorums from non-
current replicas. Another modification: obsolete replicas updated.
(e.g.,) System with 4 replicas at 4 sites. Votes assigned: V1 = 1,
V2 = 1, V3 = 2, & V4 = 1.
Let disk latency at S1 = 75 msec, S2 = 750 msec, S3 = 750 msec. &
S4 = 100 msec.
If r = 1 & w = 5, and read access time is 75 ms and write access is 750
msec.
B. Prabhakaran 22
Dynamic Voting
Assume in the above example, site 3 becomes unreachable
from other sites.
Sites 1, 2, & 4 can still collect a quorum, however, Site 3
cannot.
Another partition in {1,2,4} will make any site unavailable.
Dynamic voting: adapt the number of votes or the set of
sites that can form a quorum, to the changing state of the
system due to sites & communication failures.
Approaches:
Majority based approach: set of sites change with system state.
This set can form a majority to allow access to replicated data.
Dynamic vote reassignment: number of votes assigned to a site
changes dynamically.
B. Prabhakaran 23
Majority-based Approach
• Figure indicates the partitions and
ABCDE
(one) merger that takes place
• Assume one vote per copy.
ABD • Static voting scheme: only partitions
CE
ABCDE, ABD, & ACE allowed
access.
AB D • Majority-based approach: one partition
can collect quorums and the other cannot.
• Partitions ABCDE, ABD, AB, A, and
A B ACE can collect quorums, others cannot.
ACE
B. Prabhakaran 24
Majority Approach ...
Notations used:
Version Number, VNi: of a replica at a site i is an integer that
counts the number of successful updates to the replica at i.
Initially set to 0.
Number of replicas updated, RUi: Number of replicas participating
in the most recent update. Initially set to the total number of
replicas.
Distinguished sites list, DSi,: at i is a variable that stores IDs of
one or more sites. DSi depends on RUi.
RUi is even: DSi identifies the replica that is greater (as per the
linear ordering) than all the other replicas that participated in
the most recent update at i.
RUi is odd: DSi is nil.
RUi = 3: DSi lists the 3 replicas that participated in the most
recent update from which a majority is needed to allow access
to data.
B. Prabhakaran 25
Majority Approach: Example
Example:
5 replicas of a file stored at sites A,B,C,D, and E.State of the
system is shown in table. Each replica has been updated 3 times,
RUi is 5 for all sites. DSi is nil (as RUi is odd and != 3).
A B C D E
VN 3 3 3 3 3
RU 5 5 5 5 5
DS - - - - -
B receives an update request, finds it can communicate only to A
& C. B finds that RU is 5 for the last update. Since partition ABC
has 3 of the 5 copies, B decides that it belongs to a distinguished
partition. State of the system:
A B C D E
VN 4 4 4 3 3
RU 3 3 3 5 5
DS ABC ABC ABC - -
B. Prabhakaran 26
Majority Approach: Example...
Example...:
Now, C needs to do an update and finds it can communicate only
to B. Since RUc is 3, it chooses the static voting protocol and so
DS & RU are not updated. System state:
A B C D E
VN 4 5 5 3 3
RU 3 3 3 5 5
DS ABC ABC ABC - -
Next, D makes an update, finds it can communicate with B,C, & E.
Latest version in BCDE is 5 with RU = 3. A majority from DS =
ABC is sought and is available (i.e., BC). RU is now set to 4. RU
is even, DS set to B (highest lexicographical order). System state:
A B C D E
VN 4 6 6 6 6
RU 3 4 4 4 4
DS ABC B B B B
B. Prabhakaran 27
Majority Approach: Example...
Example...:
C receives an update, finds it can communicate only with B. BC
has half the sites in the previous partition and has the distinguished
site B (DS is used to break the tie in the case of even numbers).
Update can be carried out in the partition. Resulting state:
A B C D E
VN 4 7 7 6 6
RU 3 2 2 4 4
DS ABC B B B B
B. Prabhakaran 28
Majority-based: Protocol
Site i receives an update and executes following protocol:
i issues a Lock_Request to its local lock manager
Lock granted? : i issues a Vote_Request to all the sites.
Site j receives the request: j issues a Lock_Request to its local lock
manager. Lock granted? : j sends the values of VNj, RUj, and DSj
to i.
Based on responses received, i decides whether it belongs to the
distinguished partition procedure.
i does not belong to distinguished partition? : issues Release_Lock
to local lock manager and Abort to other sites (which will issue
Release_Lock to their local lock manager).
i belongs to distinguished partition? : performs update on local
copy (current copy obtained before update is local copy is not
current). i sends a commit message to participating sites with
missing updates and values of VN, RU, and DS. Issues a
Release_Lock request to local lock manager.
B. Prabhakaran 29
Majority-based: Protocol
Site i receives an update and executes following protocol...:
Site j receives commit message: updates its replica, RU, VN, & DS,
and sends Release_Lock request to local lock manager.
Distinguished Partition: Let P denote the set of responding
sites.
i calculates M (the most recent version in partition), Q (set of sites
containing version M), and N (the number of sites that participated
in the latest update):
M = max{VNj : j in P}
Q = {j in P : VNj = M}
N = RUj, j in Q
|Q| > N/2 ? : i is a member of distinguished partition.
|Q| = N/2 ? : tie needs to be broken. Select a j in Q. If DSj in Q, i
belongs to the distinguished partition. (If RUj is even, DSj contains
the highest ordered site). i.e., i is in the partition containing the
distinguished site.
B. Prabhakaran 30
Majority-based: Protocol
Distinguished Partition...:
If N = 3 and if P contains 2 or all 3 sites indicated by DS variable
of the site in Q, i belongs to the distinguished partition.
Otherwise, i does not belong to a distinguished partition.
Update: invoked when a site is ready to commit. Variables
are updated as follows:
VNi = M + 1
RUi = cardinality of P (i.e., |P|)
DSi is updated when N != 3 (since static voting protocol is used
when N = 3).
DSi = K if RUi is even, where K is the highest ordered site
DSi = P if RUi = 3
Majority-based protocol can have deadlocks as it uses locks.
One needs a deadlock detection & resolution mechanism
along with this.
B. Prabhakaran 31
Failure Resilience
Resilient process: is one that masks failures and guarantees
progress despite a certain number of failures.
Approaches:
Backup process:
A primary process + 1 or more backup processes
Primary process executes while backups are inactive
If primary fails, one of the backup processes take over the
functions of the primary process.
To facilitate this takeover, state of the primary process is
checkpointed at appropriate intervals.
Checkpointed state stored in a place that will be available even
in the case of primary process failure.
Plus:
Little system resources are consumed by backup processes as they
are inactive.
B. Prabhakaran 32
Failure Resilience
Resilient Approaches:
Backup process...:
Minus...:
Delay in takeover as (1) backups need to detect primary’s
Replicated execution:
Several processes execute simultaneously. Service continues as
long as one of them is available.
Plus:
Provides increased reliability and availability
B. Prabhakaran 33
Failure Resilience
Resilient Approaches...:
Replicated execution...:
Minus:
More system resources needed for all processes
B. Prabhakaran 34