Chapter 12
Chapter 12
CUP 2008
1 / 48
Locking is too restrictive; need concurrent access With replica management, problem of consistency arises! = weaker consistency models (weaker than von Neumann) reqd
process process process
response invocation
Memory manager
Memory manager
Memory manager
CUP 2008
2 / 48
Locking is too restrictive; need concurrent access With replica management, problem of consistency arises! = weaker consistency models (weaker than von Neumann) reqd
process process process
response invocation
Memory manager
Memory manager
Memory manager
CUP 2008
2 / 48
Advantages/Disadvantages of DSM
Advantages: Shields programmer from Send/Receive primitives Single address space; simplies passing-by-reference and passing complex data structures Exploit locality-of-reference when a block is moved DSM uses simpler software interfaces, and cheaper o-the-shelf hardware. Hence cheaper than dedicated multiprocessor systems No memory access bottleneck, as no single bus Large virtual memory space DSM programs portable as they use common DSM programming interface Disadvantages: Programmers need to understand consistency models, to write correct programs DSM implementations use async message-passing, and hence cannot be more ecient than msg-passing implementations By yielding control to DSM manager software, programmers cannot use their own msg-passing solutions.
A. Kshemkalyani and M. Singhal (Distributed Computing) Distributed Shared Memory CUP 2008 3 / 48
Advantages/Disadvantages of DSM
Advantages: Shields programmer from Send/Receive primitives Single address space; simplies passing-by-reference and passing complex data structures Exploit locality-of-reference when a block is moved DSM uses simpler software interfaces, and cheaper o-the-shelf hardware. Hence cheaper than dedicated multiprocessor systems No memory access bottleneck, as no single bus Large virtual memory space DSM programs portable as they use common DSM programming interface Disadvantages: Programmers need to understand consistency models, to write correct programs DSM implementations use async message-passing, and hence cannot be more ecient than msg-passing implementations By yielding control to DSM manager software, programmers cannot use their own msg-passing solutions.
Distributed Shared Memory CUP 2008 3 / 48
Semantics for concurrent access must be clearly specied Semantics replication? partial? full? read-only? write-only? Locations for replication (for optimization) If not full replication, determine location of nearest data for access Reduce delays, # msgs to implement the semantics of concurrent access Data is replicated or cached Remote access by HW or SW Caching/replication controlled by HW or SW DSM controlled by memory management SW, OS, language run-time system
CUP 2008
4 / 48
Semantics for concurrent access must be clearly specied Semantics replication? partial? full? read-only? write-only? Locations for replication (for optimization) If not full replication, determine location of nearest data for access Reduce delays, # msgs to implement the semantics of concurrent access Data is replicated or cached Remote access by HW or SW Caching/replication controlled by HW or SW DSM controlled by memory management SW, OS, language run-time system
CUP 2008
4 / 48
Type of DSM single-bus multiprocessor switched multiprocessor NUMA system Page-based DSM Shared variable DSM Shared object DSM
Examples Firey, Sequent Alewife, Dash Buttery, CM* Ivy, Mirage Midway, Munin Linda, Orca
Caching hardware control hardware control software control software control software control software control
CUP 2008
5 / 48
Memory Coherence
si memory operations by Pi (s1 + s2 + . . . sn )!/(s1 !s2 ! . . . sn !) possible interleavings Memory coherence model denes which interleavings are permitted Traditionally, Read returns the value written by the most recent Write Most recent Write is ambiguous with replicas and concurrent accesses DSM consistency model is a contract between DSM system and application programmer
process
op1
op2
op3
opk
invocation response
CUP 2008
6 / 48
Strict consistency
1
A Read should return the most recent value written, per a global time axis. For operations that overlap per the global time axis, the following must hold. All operations appear to be atomic and sequentially executed. All processors see the same order of events, equivalent to the global time ordering of non-overlapping events.
op1 op2 op3 opk
invocation response
2 3
process
CUP 2008
7 / 48
P 1 P2
P 1 P 2
Linearlzability: Implementation
Simulating global time axis is expensive. Assume full replication, and total order broadcast support.
(shared var) int: x ; (1) When the Memory Manager receives a Read or Write from application: (1a) total order broadcast the Read or Write request to all processors; (1b) await own request that was broadcast; (1c) perform pending response to the application as follows (1d) case Read: return value from local replica; (1e) case Write: write to local replica and return ack to application. (2) When the Memory Manager receives a total order broadcast(Write, x, val) from network: (2a) write val to local replica of x . (3) When the Memory Manager receives a total order broadcast(Read, x) from network: (3a) no operation.
CUP 2008
9 / 48
When a Read in simulated at other processes, there is a no-op. Why do Reads participate in total order broadcasts? Reads need to be serialized w.r.t. other Reads and all Write operations. See counter-example where Reads do not participate in total order broadcast.
Read(x,0)
CUP 2008
10 / 48
When a Read in simulated at other processes, there is a no-op. Why do Reads participate in total order broadcasts? Reads need to be serialized w.r.t. other Reads and all Write operations. See counter-example where Reads do not participate in total order broadcast.
Read(x,0)
CUP 2008
10 / 48
Sequential Consistency
Sequential Consistency.
The result of any execution is the same as if all operations of the processors were executed in some sequential order. The operations of each individual processor appear in this sequence in the local program order. Any interleaving of the operations from the dierent processors is possible. But all processors must see the same interleaving. Even if two operations from dierent processors (on the same or dierent variables) do not overlap in a global time scale, they may appear in reverse order in the common sequential order seen by all. See examples used for linearizability.
CUP 2008
11 / 48
Sequential Consistency
Only Writes participate in total order BCs. Reads do not because: all consecutive operations by the same processor are ordered in that same order (no pipelining), and Read operations by dierent processors are independent of each other; to be ordered only with respect to the Write operations. Direct simplication of the LIN algorithm. Reads executed atomically. Not so for Writes. Suitable for Read-intensive programs.
CUP 2008
12 / 48
(1) When the Memory Manager at Pi receives a Read or Write from application: (1a) case Read: return value from local replica; (1b) case Write(x,val): total order broadcasti (Write(x,val)) to all processors including itself.
(2) When the Memory Manager at Pi receives a total order broadcastj (Write, x, val) from network (2a) write val to local replica of x ; (2b) if i = j then return ack to application.
CUP 2008
13 / 48
(3) When the Memory Manager at Pi receives a total order broadcastj (Write, x, val) from network (3a) write val to local replica of x . (3b) if i = j then (3c) counter counter 1; (3d) if (counter = 0 and any Reads are pending) then (3e) perform pending responses for the Reads to the application. Locally issued Writes get acked immediately. Local Reads are delayed until the locally preceding Writes have been acked. All locally issued Writes are pipelined.
A. Kshemkalyani and M. Singhal (Distributed Computing) Distributed Shared Memory CUP 2008 14 / 48
Causal Consistency
In SC, all Write ops should be seen in common order. For causal consistency, only causally related Writes should be seen in common order.
W(x,2) W(x,4) R(x,4) W(x,7) P2 P3 P4 R(x,2) R(x,7) R(x,4) R(x,7)
P1
(a)Sequentially consistent and causally consistent W(x,2) W(x,4) W(x,7) R(x,7) R(x,2)
P1 P2 P3 P4
R(x,4)
R(x,7)
(b) Causally consistent but not sequentially consistent W(x,2) W(x,4) R(x,4) W(x,7) R(x,2) R(x,7)
P1 P2
P3 Total order broadcasts (for SC) also R(x,7) R(x,4) P4 provide causal order in shared memory (c) Not causally consistent but PRAM consistent A. systems. Kshemkalyani and M. Singhal (Distributed Computing) Distributed Shared Memory CUP 2008 15 / 48
CUP 2008
16 / 48
Slow Memory
Slow Memory
Only Write operations issued by the same processor and to the same memory location must be seen by others in that order.
P1 P2 W(x,2) W(y,4) R(y,4) W(x,7) R(x,0) R(x,0) R(x,7)
(a) Slow memory but not PRAM consistent W(x,2) W(y,4) R(y,4) W(x,7) R(x,7) R(x,0) R(x,2)
P1 P2
CUP 2008
17 / 48
no consistency model pipelined RAM (PRAM) Sequential consistency Linearizability/ Atomic consistency/ Strict consistency Causal consistency Slow memory
CUP 2008
18 / 48
Weak consistency:
All Writes are propagated to other processes, and all Writes done elsewhere are brought locally, at a sync instruction. Accesses to sync variables are sequentially consistent Access to sync variable is not permitted unless all Writes elsewhere have completed No data access is allowed until all previous synchronization variable accesses have been performed Drawback: cannot tell whether beginning access to shared variables (enter CS), or nished access to shared variables (exit CS).
Distributed Shared Memory CUP 2008 19 / 48
Release Consistency
Acquire indicates CS is to be entered. Hence all Writes from other processors should be locally reected at this instruction Release indicates access to CS is being completed. Hence, all Updates made locally should be propagated to the replicas at other processors. Acquire and Release can be dened on a subset of the variables. If no CS semantics are used, then Acquire and Release act as barrier synchronization variables. Lazy release consistency: propagate updates on-demand, not the PRAM way.
Entry Consistency
Each ordinary shared variable is associated with a synchronization variable (e.g., lock, barrier) For Acquire /Release on a synchronization variable, access to only those ordinary variables guarded by the synchronization variables is performed.
A. Kshemkalyani and M. Singhal (Distributed Computing) Distributed Shared Memory CUP 2008 20 / 48
CUP 2008
21 / 48
Mutual exclusion
Role of line (1e)? Wait for others timestamp choice to stabilize ... Role of line (1f)? Wait for higher priority (lex. lower timestamp) process to enter CS
Bounded waiting: Pi can be overtaken by other processes at most once (each) Progress: lexicographic order is a total order; process with lowest timestamp in lines (1d)-(1g) enters CS Space complexity: lower bound of n registers Time complexity: (n) time for Bakery algorithm Lamports fast mutex algorithm takes O (1) time in the absence of contention. However it compromises on bounded waiting. Uses W (x ) R (y ) W (y ) R (x ) sequence necessary and sucient to check for contention, and safely enter CS
CUP 2008
22 / 48
CUP 2008
23 / 48
Examine all possible race conditions in algorithm code to analyze the algorithm.
CUP 2008
24 / 48
CUP 2008
25 / 48
repeat (1) Pi executes the following for the entry section: (1a) blocked true ; (1b) repeat (1c) Swap (Reg , blocked ); (1d) until blocked = false ; (2) Pi executes the critical section (CS) after the entry section (3) Pi executes the following exit section after the CS: (3a) Reg false ; (4) Pi executes the remainder section after the exit section until false;
CUP 2008
26 / 48
repeat (1) Pi executes the following for the entry section: (1a) waiting [i ] true ; (1b) blocked true ; (1c) while waiting [i ] and blocked do (1d) blocked Test &Set (Reg ); (1e) waiting [i ] false ; (2) Pi executes the critical section (CS) after the entry section (3) Pi executes the following exit section after the CS: (3a) next (i + 1)mod n; (3b) while next = i and waiting [next ] = false do (3c) next (next + 1)mod n; (3d) if next = i then (3e) Reg false ; (3f) else waiting [next ] false ; (4) Pi executes the remainder section after the exit section until false;
A. Kshemkalyani and M. Singhal (Distributed Computing) Distributed Shared Memory CUP 2008 27 / 48
Wait-freedom
Synchronizing asynchronous processes using busy-wait, locking, critical sections, semaphores, conditional waits etc. = crash/ delay of a process can prevent others from progressing. Wait-freedom: guarantees that any process can complete any synchronization operation in a nite number of low-level steps, irresp. of execution speed of others. Wait-free implementation of a concurrent object = any process can complete on operation on it in a nite number of steps, irrespective of whether others crash or are slow. Not all synchronization problems have wait-free solutions, e.g., producer-consumer problem. An n 1-resilient system is wait-free.
CUP 2008
28 / 48
Safe register
A Read that does not overlap with a Write returns the most recent value written to that register. A Read that overlaps with a Write returns any one of the possible values that the register could ever contain.
P1 P2 P3
Write11 (x,4)
Write13 (x,6)
CUP 2008
29 / 48
Regular register
Safe register + if a Read overlaps with a Write, value returned is the value before the Write operation, or the value written by the Write.
Atomic register
Regular register + linearizable to a sequential register
P1 P2 P3
Write11 (x,4)
Write13 (x,6)
CUP 2008
30 / 48
R
R1
Write to
Writes to individual Ri
Rq
Read from
CUP 2008
31 / 48
q = m, the largest integer. The integer is stored in unary notation. P0 is writer. P1 to Pn are readers, each can read all m registers. Readers scan L to R looking for rst 1; Writer writes 1 in Rval and then zeros out entries R to L. Complexity: m binary registers, O (m) time.
CUP 2008
35 / 48
Construction 5: Algorithm
(shared variables) boolean MRSW regular registers R1 . . . Rm1 0; Rm 1; // Ri readable by all, writable by P0 . (local variables) integer: count ; (1) Write(R , val ) executed by writer P0 (1a) Rval 1; (1b) for count = val 1 down to 1 do (1c) Rcount 0. (2) Readi (R , val ) executed by Pi , 1 i n (2a) count = 1; (2b) while Rcount = 0 do (2c) count count + 1; (2d) val count ; (2e) return(val ).
CUP 2008
36 / 48
R
Zero out entries
R1
R2 R3
R val
Rm
Scan for "1"; return index. (bool MRSW reg to int MRSW reg) Scan for first "1"; then scan backwards and update pointer to lowestranked register containing a "1"
(bool MRSW atomic to int MRSW atomic)
Read( R )
CUP 2008
37 / 48
Write1 a(R,2)
Write(R2 ,1)
Write2 a(R,3)
1
a
Read(R 3,1) Read(R 1,0) Read(R 2,1)
Pb
Construction 6: Algorithm
(shared variables) boolean MRSW regular registers R1 . . . Rm1 0; Rm 1. // Ri readable by all; writable by P0 . (local variables) integer: count , temp ; (1) Write(R , val ) executed by P0 (1a) Rval 1; (1b) for count = val 1 down to 1 do (1c) Rcount 0. (2) Readi (R , val ) executed by Pi , 1 i n (2a) count 1; (2b) while Rcount = 0 do (2c) count count + 1; (2d) val count ; (2e) for temp = count down to 1 do (2f) if Rtemp = 1 then (2g) val temp ; (2h) return(val ).
CUP 2008
39 / 48
CUP 2008
40 / 48
CUP 2008
41 / 48
CUP 2008
42 / 48
1,1
1,2
1,n
R1
R2
Rn
2,1
2,2
2,n
n,1
n,2
n,n
CUP 2008
43 / 48
Construction 8: Algorithm
(shared variables) SRSW atomic register of type data, seq no , where data, seq no are integers: R1 . . . Rn 0, 0 ; SRSW atomic register array of type data, seq no , where data, seq no are integers: Last Read Values [1 . . . n, 1 . . . n] 0, 0 ; (local variables) array of data, seq no : Last Read [0 . . . n]; integer: seq , count ; (1) Write(R , val ) executed by writer P0 (1a) seq seq + 1; (1b) for count = 1 to n do (1c) Rcount val , seq . // write to each SRSW register (2) Readi (R , val ) executed by Pi , 1 i n (2a) Last Read [0].data, Last Read [0].seq no Ri ; // Last Read [0] stores value of Ri (2b) for count = 1 to n do // read into Last Read [count ], the latest values stored for Pi by Pcount (2c) Last Read [count ].data, Last Read [count ].seq no Last Read Values [count , i ].data, Last Read Values [count , i ].seq no ; (2d) identify j such that for all k = j , Last Read [j ].seq no Last Read [k ].seq no ; (2e) for count = 1 to n do (2f) Last Read Values [i , count ].data, Last Read Values [i , count ].seq no Last Read [j ].data, Last Read [j ].seq no ; (2g) val Last Read [j ].data; (2h) return(val ).
CUP 2008
44 / 48
CUP 2008
45 / 48
CUP 2008
47 / 48
j changed[j]=2
Pj writes in this period Pj writes
Pj
Pj writes
(b) P_js DoubleCollect nested within P_is SCAN. The DoubleCollect is successful, or P_j borrowed snapshot from P_ks DoubleCollect nested within P_js SCAN. And so on recursively, up to n times.
CUP 2008
48 / 48