Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
P P P
$ $ …….. $
Memory Bus
1
Shared Cache Multiprocessor
P P …….. P
Interleaved $
Memory
2
Dance Hall Multiprocessor
P P P
$ $ …….. $
Memory Memory
• Large-scale machines
• E.g., NYU Ultracomputer, IBM RP3
3
Distributed Shared Memory (DSM)
P P P
……..
Memory
Memory
$ $
Memory
4
Cache Coherence Problem
Time P1 X=0 P2
$ $
Load X
Load X
Store 4, X
Load X
• What value of X is in P1 and P2’s caches?
• What value of X is in memory?
5
Cache-coherence problem
Proc0 Proc1
1 Ld X 3 Ld X
5 St X, 1 ...
7 Ld X
Cache0 4 X=0 Cache1
2 X=0
6 X=1 8 X=0
Memory
X=0
6
Write-through Coherence Protocol
Proc0 Proc1
1 Ld X 3 Ld X
5 St X, 1
7 Ld X
Cache0 4 X=0 Cache1
2 X=0
6 X=1 5c X=0
8 X=1
5a Write X
Memory
0 X=0
5b X=1
7
Write-through State Transition Diagram
PrRd/— PrWr/BusWr
PrRd/BusRd BusWr/—
• PrRd, PrWr
• BusRd (go to memory get data)
• BurWr (write data to memory/invalidate cached copies)
8
Problem with Write-Through
High bandwidth requirements
• Every write from every processor goes to shared bus and memory
• Consider 200MHz, 1CPI processor, and 15% instrs. are 8-byte
stores
• Each processor generates 30M stores or 240MB data per second
• 1GB/s bus can support only about 4 processors without saturating
• Write-through especially unpopular for SMPs
Write-back caches absorb most writes as cache hits
• Write hits don’t go on bus
• But now how do we ensure write propagation and serialization?
• Need more sophisticated protocols: large design space
Solution?
Write-back-based protocols
9
Design Space for Snooping Protocols
No need to change processor, main memory, cache …
• Extend cache controller and exploit bus (provides serialization)
Focus on protocols for write-back caches
Design space
• Invalidation versus Update-based protocols
– On write invalidate or update other copies
• Set of states
– Block OWNER:
• thus far data comes only from memory which is always updated
• owner is the one that is responsible for supplying data
10
Invalidate versus Update
Basic question of program behavior
• Is a block written by one processor read by others before it is
rewritten?
Invalidation:
• Yes => readers will take a miss
• No => multiple writes without additional traffic
– and clears out copies that won’t be used again
Update:
• Yes => readers will not miss if they had a copy previously
– single bus transaction to update all copies
• No => multiple useless updates, even to dead copies
Need to look at program behavior and hardware complexity
Invalidation protocols much more popular (more later)
• Some systems provide both, or even hybrid
11
Basic MSI Writeback Inval. Protocol
States
• Invalid (I)
• Shared (S): one or more
• Dirty or Modified (M): one only
Processor Events:
• PrRd (read)
• PrWr (write)
Bus Transactions
• BusRd: asks for copy with no intent to modify
• BusRdX: asks for copy with intent to modify
• BusWB (shown as Flush later on): updates memory
Actions
• Update state, perform bus transaction, flush value onto bus
12
MSI: Behavior
(1) Read hit: use local copy; no state change (states S or M)
(2) Read miss:
- if M copy exists, it is flushed onto the bus and memory;
all copies set to S
- otherwise, access memory; state set to S
(3) Write hit:
- if local copy is S; request exclusive copy; other copies
are invalidated; local copy set to M
- if local copy M; just write locally; no state changes
(4) Write miss:
- generate read excl. request; all other copies are
invalidated; if M copy exists it is flushed; set local state to M
13
Simple MSI Protocol: SGI 4D
Write-invalidate for write-back caches
PrRd: Processor read (load)
PrWr: Processor write (store)
BusRd: ReadOnly copy due to a PrRd
BusRdX: Writable copy due to a PrWr
BusWB: Writing back a block
BusInv: Invalidate other copies
BusCache: Cache-to-cache block transfer
BusUpdate: One/Two word update
14
Simple MSI Protocol: SGI 4D
I - Invalid
S - Shared
I
M - Modified
/ -
Pr
X
W
Rd
r/B
s
Bu
Bu
us
d
Rd
R
Rd
us
X
X/
PrRd/- /B PrRd/-
d
Bu
R
BusRd/- Pr
sW
PrWr/-
B
PrWr/BusRdX
S M
BusRd/BusWB
15
MSI:State Transition Diagram
PrRd/— PrWr/—
PrWr/BusRdX BusRd/Flush
S BusRdX/Flush
BusRdX/—
PrRd/BusRd
PrRd/—
PrWr/BusRdX BusRd/—
16
Lower-level Protocol Choices
BusRd observed in M state: what transitition to make, S or I?
Depends on expectations of access patterns
• S: assumption that I’ll read again soon, rather than other will write
– good for mostly read data
– what about “migratory” data
• I read and write, then you read and write, then X reads and
writes...
• better to go to I state, so I don’t have to be invalidated on your
write
• Synapse transitioned to I state
• Sequent Symmetry and MIT Alewife use adaptive protocols
Choices can affect performance of memory system
17
MESI (4-state) Invalidation Protocol
18
MESI (4-state) Invalidation Protocol
19
MESI State Transition Diagram
PrRd/—
PrWr/—
M BusRdX/Flush
BusRd/Flush
PrWr/—
PrWr/BusRdX
E
BusRd/
Flush BusRdX/Flush
PrRd/—
PrWr/BusRdX
S
PrRd/ BusRdX/Flush
BusRd (!S)
PrRd/—
BusRd/Flush
PrRd/
BusRd(S)
I
Read HIT:
• read to any other state other than INVALID
• never generates BUS transaction
Write MISS:
• write to INVALID, Non-existent, or READ-ONLY (SHARED, …)
• generates BUS transaction
Write HIT
• write to READ-WRITE state (Modified,...)
21
MESI: behavior
(1) Read hit: use local copy; no state change (can be S, M or
E)
(2) Read miss:
- if no other copy exists get from memory; set local copy to E
- if E or S copies exist; get from memory (or the cache w/ E);
set all copies to S
- if M copy exists; get that (could be via memory); set both
copies to S
22
MESI: behavior
(3) Write hit:
- if local copy in E or M state; write locally; set state to M
- if local copy in S; invalidate all other copies; set state to M
(4) Write miss:
- if no other copy; get from memory; set state to M
- if M copy exists; flush that copy; set state to M
- if E or S copies exist; invalidate them; set state to M
23
Lower-level Protocol Choices
Who supplies data on miss when not in M state: memory or
cache
Original, lllinois MESI: cache, since assumed faster than
memory
• Cache-to-cache sharing
Not necessarily true in modern systems
• Intervening in another cache more expensive than getting from
memory
24
Lower-level Protocol Choices
Cache-to-cache sharing also adds complexity
• How does memory know it should supply data (must wait for
caches)
• Selection algorithm if multiple caches have valid data
But valuable for cache-coherent machines with distributed
memory
• May be cheaper to obtain from nearby cache than distant
memory
• Especially when constructed out of SMP nodes (Stanford DASH)
25
MOESI: behavior
As MESI but new state O=Owned
Have copy which could be shared but memory does not have
the most up-to-date value
26
Coherence protocols
And now for a little bit of history
27
Write-Once (invalidation)
First to be described in the literature4 states: I, V, R(eserved),
D(irty), global invalidate line
(1) Read Hit:: access from cache, no state change
(2) Read Miss: if another cache has DIRTY copy, it inhibits memory, writes
the line back, and the requesting cache gets copy
Else, the line is loaded from memory
All caches with a copy set it VALID
(3) Write hit : if the line is DIRTY, write proceeds locally
If RESERVED, proceed locally and mark DIRTY
If VALID, write-through and mark RESERVED; other caches set state to
INVALID
(4) Write Miss: like a read miss, the line is copied from memory or from a
DIRTY copy; line marked DIRTY. All other caches invalidate copies
28
Write-Once Protocol
I - Invalid V - Valid R - Reserved D - Dirty
PrRd/- BusWB/-
BusRd/- BusRdX/-
V I
PrRd/BusRd -
/
PrWr/BusWrOnce
BusRdX/BusWB
X
PrWr/BusRdX
R d
us
BBusRd/-
Bu
s Rd PrRd/-
/ Bu PrWr/-
sW
B
R D
PrRd/- PrWr/-
Reserved: had copy, written once and no-one asked for it yet
BusWrOnce: BusRdX followed by BusWB
29
Synapse (invalidation)
First bus-based protocol Implemented! (protocol like SGI 4D)
3 states: I, S, and D
Avoid global inhibit line: use a tag bit in memory; if set, memory not
uptodate
(1) Read hit: access from cache; no state change
(2) Read miss:
• If another cache has a DIRTY copy,it supplies a nack, then writes back to
memory, resets tag bit in memory and sets its local state to INVALID; then
the requesting cache makes a second miss; the loaded line is set to VALID
• If block Shared, read from memory
(3) Write hit: if DIRTY, proceed locally; no state change
• if VALID, proceed like a write miss; including data transfer. There is no
invalidation signal
(4) Write miss: like a read miss but all VALID copies are set invalid
• line’s tag in main memory is set
30
Synapse Protocol
I - Invalid
S - Shared
I
D - Dirty
/ -
Pr
X
W
R d
r/B
us
Bu
B
us
d
Rd
R
Rd
us
X
X/
PrRd/- /B PrRd/-
d
Bu
R
BusRd/- Pr
sW
PrWr/-
B
PrWr/BusRdX
S D
BusRd/BusWB
BusRdX/BusWB
PrWr/BusRdX
X /-
R d /-
us nv
B usI
B PrRd/-
PrWr/-
BusRd/BusCache
SD D
PrRd/- PrWr/BusInv
BusRd/BusCache
33
Illinois Protocol
Implemented in SGI multiprocessors
4 states: I, V, S, VE (valid exclusive, similar to MESI)
Missed data always comes from caches, bus SharedLine
(1) Read hit: blah blah
(2) Read miss:
• If block Dirty, transfer cache-to-cache, and write back; state to shared
• If block Shared, transfer cache-to-cache; set state to Shared
• If no cached copy get from mem; set state to Valid-Exclusive
(3) Write hit: local Dirty? Grab that no state changes
• local VE? State to Dirty
• local Shared? Invalidate all other copies; state to Dirty
(4) Write miss:
• Same as Read miss; local set to Dirty all others invalidated
34
Illinois Protocol
I - Invalid S - Shared VE - Valid-Exclusive D - Dirty
BusRdX/-
BusInv/-
PrRd/- S I
Pr PrRd/BusRd
BusRd/BusCache Bu W
r/B
BusRdX/BusWB
BusRd/BusCache
PrWr/BusRdX
sR us
PrRd/BusRd d /B In
us v
W
-/ B
dX R d
PrRd/BusRd s R us PrRd/-
Bu d /B PrWr/-
R
Pr
VE D
PrRd/- PrWr/-
PrWr/BusRdX
35
Firefly write-back update protocol
Good performance when multiple processors are repeatedly
reading and updating the same location
3 states: VALID-EXCLUSIVE, SHARED and DIRTY (similar to
MES w/o I)
global shared line
36
Firefly: behavior
(1) Read hit: access from cache; no state change
(2) Read miss: if another cache has copy they place it on the bus (multiple
possible); set all copies to SHARED; if DIRTY exists it is written back to
memory otherwise get from memory and set state to Valid-Exclusive
(3) Write hit: if local copy is DIRTY proceed locally
if local copy is VE proceed locally; state set to DIRTY
if local copy is SHARED; a write to memory is initiated; other caches pick
up copy and set their state to SHARED; local copy is set to either VE or
SHARED (if other copies exist)
(4) Write miss: like a read miss; local copy is set to SHARED if other copies
exist in which case memory is updated also; if no other copies exist local
copy is set to DIRTY
37
Dragon Write-back Update Protocol
4 states
• Exclusive-clean or exclusive (E): I and memory have it
• Shared clean (Sc): I, others, and maybe memory, but I’m not
owner
• Shared modified (Sm): I and others but not memory, and I’m the
owner
– Sm and Sc can coexist in different caches, with only one Sm
• Modified or dirty (D): I and, no one else
No invalid state
• If in cache, cannot be invalid
• If not present in cache, can view as being in not-present or
invalid state
38
Dragon Write-back Update Protocol
New processor events: PrRdMiss, PrWrMiss
• Introduced to specify actions when block not present in cache
New bus transaction: BusUpd
• Broadcasts single word written on bus; updates other relevant
caches
39
Dragon: behavior
(1) Read hit: proceed locally; no state change
(2) Read miss: if another cache has a D or Sm copy, it supplies data and raises
the SharedLine signal. Supplying cache sets its copy to Sm; local copy is set
to Sc
if no D or Sm copies exist value comes from memory; Any cache with a copy
(E or Sc) raises the SharedLine signal; local copy is set to Sc if SharedLine is
raised otherwise is set to E
(3) Write hit: if local copy is D proceed locally; no state change
if local copy in E; proceed locally; state set to D
if local copy is Sm or Sc; delay write and initiate bus write; other caches get
new data and update their local copies; they set their copies to Sc; local copy
is set to Sm if other copies exist otherwise is set to D
(4) Write miss: like a read miss copy comes from cache with Sm or D copy
otherwise from M; if other copies exist they are set to Sc; local copy is set to
Sm if other copies exist or D if this is the only copy
40
Dragon State Transition Diagram
PrRd/— PrRd/—
BusUpd/Update
BusRd/—
E Sc
PrRdMiss/BusRd(S) PrRdMiss/BusRd(S)
PrWr/—
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusUpd/Update
PrWrMiss/(BusRd(S); BusUpd)
BusRd/Flush
PrWrMiss/BusRd(S)
Sm M
PrWr/BusUpd(S)
PrRd/—
PrRd/—
PrWr/BusUpd(S)
BusRd/Flush PrWr/—
41
Cache Coherence & Mem Ordering
• Cache coherence
• For a single memory location (address)
– Program order per process
– Value returned by read is the “latest” value written
43
Memory Order:
What Programmers Expect
P P …….. P
Load/Store
Shared Memory
44
Memory Accessing Order
A = 0 and flag = 0
P P
A = 1;
While (flag = = 0);
flag = 1;
print A;
45
Memory Accessing Order: The Reality
What causes A to print 0?
• Out-of-order execution in the processor
• Compiler re-ordering accessing
• Shared-memory hardware: network, write buffers, etc.
How do you make sure the programmer gets what they want?
• Change the programming interface: memory models
• The programmer enforces order through annotations
• What are the advantages/disadvantages?
46
Write Atomicity
Write Atomicity: Position in total order at which a write
appears to perform should be the same for all processes
• Nothing a process does after it has seen the new value produced
by a write W should be visible to other processes until they too
have seen W
P1 P2 P3
A=1; while (A==0);
B=1; while (B==0);
print A;
P0 : R R R W R R
P1 : R R R R R W
P2 : R R R R R R
49
SC in Write-through Example
Provides SC, not just coherence
50
MSI:State Transition Diagram
PrRd/— PrWr/—
PrWr/BusRdX BusRd/Flush
S BusRdX/Flush
BusRdX/—
PrRd/BusRd
PrRd/—
PrWr/BusRdX BusRd/—
51
Satisfying Coherence
Everything like VI simple write-back protocol except for:
• Writes that don’t appear on the bus:
– sequence of such writes between two bus xactions for the block
must come from same processor, say P
– in serialization, the sequence appears between these two bus
xactions
– reads by P will seem them in this order w.r.t. other bus transactions
– reads by other processors separated from sequence by a bus
xaction, which places them in the serialized order w.r.t the writes
– so reads by all processors see writes in same order
P0 : R R R W W R
P1 : R R R R W
P2 : R R R R R
52
Write Serialization?
write on bus from Px - Bus serializes all bus writes
…. and reads
Write on bus from Py - local reads serialized w/ respect
those on bus
Write hit (has to be from same proc)- local writes?
local reads (Py)
read or write from Pz
other reads
53
Satisfying Sequential Consistency
54