KTMTSS Shared Memory Multiprocessor
KTMTSS Shared Memory Multiprocessor
1
Supporting Programming Models
Compilation Multiprogramming
or library Communication abstraction
User/system boundary
Shared address space Operating systems support
Hardware/software boundary
Communication hardware
Physical communication medium
(Interleaved)
First-level $
$ $
Bus
(Interleaved)
Main memory
Mem I/O devices
P1 Pn
P1 Pn
$ $
$ $
Mem Mem
Interconnection network
Interconnection network
Mem Mem
2
Caches and Cache Coherence
Shared cache
• Low-latency sharing and prefetching across processors
• Sharing of working sets
• No coherence problem (and hence no false sharing either)
• But high bandwidth needs and negative interference (e.g. conflicts)
• Hit and miss latency increased due to intervening switch and cache size
• Mid 80s: to connect couple of processors on a board (Encore, Sequent)
• Today: for multiprocessor on a chip (for small-scale systems or nodes)
Dancehall
• No longer popular: everything is uniformly far away
Distributed memory
• Most popular way to build scalable systems, discussed later
6
3
Outline
Synchronization
Reading a location should return latest value written (by any process)
Easy in uniprocessors
• Except for I/O: coherence between I/O devices and processors
• But infrequent so software solutions work
– uncacheable memory, uncacheable operations, flush pages, pass I/O data
through caches
Would like same to hold when processes run on different processors
• E.g. as if the processes were interleaved on a uniprocessor
4
Example Cache Coherence Problem
P1 P2 P3
$ 4 $ 5 $ 3
u:5 u:5
1 I/O devices
u:5 2
Memory
10
5
Some Basic Definitions
Extend from definitions in uniprocessors to those in multiprocessors
Memory operation: a single read (load), write (store) or read-modify-
write access to a memory location
• Assumed to execute atomically w.r.t each other
Issue: a memory operation issues when it leaves processor’s internal
environment and is presented to memory system (cache, buffer …)
Perform: operation appears to have taken place, as far as processor
can tell from other memory operations it issues
• A write performs w.r.t. the processor when a subsequent read by the
processor returns the value of that write or a later write
• A read perform w.r.t the processor when subsequent writes issued by
the processor cannot affect the value returned by the read
In multiprocessors, stay same but replace “the” by “a” processor
• Also, complete: perform with respect to all processors
• Still need to make sense of order in operations from different processes
11
6
Formal Definition of Coherence
Results of a program: values returned by its read operations
A memory system is coherent if the results of any execution of a program
are such that each location, it is possible to construct a hypothetical serial
order of all operations to the location that is consistent with the results of
the execution and in which:
1. operations issued by any particular process occur in the order issued by
that process, and
2. the value returned by a read is the value written by the last write to that
location in the serial order
Two necessary features:
• Write propagation: value written must become visible to others
• Write serialization: writes to location seen in same order by all
– if I see w1 after w2, you should not see w2 before w1
– no need for analogous read serialization since reads not visible to others
13
14
7
Snooping-based Coherence
Basic Idea
Transactions on bus are visible to all processors
Processors or their representatives can snoop (monitor) bus and take
action on relevant events (e.g. change state)
Implementing a Protocol
Cache controller now receives inputs from both sides:
• Requests from processor, bus requests/responses from snooper
In either case, takes zero or more actions
• Updates state, responds with data, generates new bus transactions
Protocol is distributed algorithm: cooperating state machines
• Set of states, state transition diagram, actions
Granularity of coherence is typically cache block
• Like that of allocation in cache and transfer to/from cache
15
P1 Pn
Bus snoop
$ $
Cache-memory
I/O devices transaction
Mem
8
Write-through State Transition Diagram
PrRd/— PrWr/BusWr
PrRd/BusRd BusWr/—
Is it Coherent?
9
Ordering Reads
Read misses: appear on bus, and will see last write in bus order
Read hits: do not appear on bus
• But value read was placed in cache by either
– most recent write by this processor, or
– most recent read miss by this processor
• Both these transactions appear on the bus
• So reads hits also see values as being produced in consistent bus order
19
P1 : R R R R R W
P2 : R R R R R R
10
Problem with Write-Through
21
Memory Consistency
Writes to a location become visible to all in the same order
But when does a write become visible
•How to establish orders between a write and a read by different
procs?
–Typically use event synchronization, by using more than
one location
P1 P2
/*Assume initial value of A and ag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;
• Intuition not guaranteed by coherence
• Sometimes expect memory to respect order between accesses to
different locations issued by a given process
– to preserve orders among accesses to same location by different processes
• Coherence doesn’t help: pertains only to single location
22
11
Another Example of Orders
P1 P2
/*Assume initial values of A and B are 0*/
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
23
24
12
Sequential Consistency
Processors
P1 P2 Pn
issuing memory
references as
per program order
Memory
26
13
SC Example
What matters is order in which appears to execute, not executes
P1 P2
/*Assume initial values of A and B are 0*/
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
– possible outcomes for (A,B): (0,0), (1,0), (1,2); impossible under SC: (0,2)
– we know 1a->1b and 2a->2b by program order
– A = 0 implies 2b->1a, which implies 2a->1b
– B = 2 implies 1b->2a, which leads to a contradiction
27
Implementing SC
28
14
Write Atomicity
Write Atomicity: Position in total order at which a write appears to
perform should be the same for all processes
• Nothing a process does after it has seen the new value produced by a
write W should be visible to other processes until they too have seen W
• In effect, extends write serialization to writes from multiple processes
P1 P2 P3
A=1; while (A==0);
B=1; while (B==0);
print A;
More Formally
Each process’s program order imposes partial order on set of all operations
Interleaving of these partial orders defines a total order on all operations
Many total orders may be SC (SC does not define particular interleaving)
30
15
Sufficient Conditions for SC
32
16
SC in Write-through Example
33
34
17
Invalidation-based Protocols
Update-based Protocols
36
18
Invalidate versus Update
States
• Invalid (I)
• Shared (S): one or more
• Dirty or Modified (M): one only
Processor Events:
• PrRd (read)
• PrWr (write)
Bus Transactions
• BusRd: asks for copy with no intent to modify
• BusRdX: asks for copy with intent to modify
• BusWB: updates memory
Actions
• Update state, perform bus transaction, flush value onto bus
38
19
State Transition Diagram
PrRd/— PrWr/—
BusRd/Flush
PrWr/BusRdX
S BusRdX/Flush
BusRdX/—
PrRd/BusRd
PrRd/—
BusRd/—
PrWr/BusRdX
39
Satisfying Coherence
20
Satisfying Sequential Consistency
1. Appeal to definition:
• Bus imposes total order on bus xactions for all locations
• Between xactions, procs perform reads/writes locally in program order
• So any execution defines a natural partial order
– Mj subsequent to Mi if (I) follows in program order on same processor, (ii)
Mj generates bus xaction that follows the memory operation for Mi
• In segment between two bus transactions, any interleaving of ops from
different processors leads to consistent total order
• In such a segment, writes observed by processor P serialized as follows
– Writes from other processors by the previous bus xaction P issued
– Writes from P by program order
2. Show sufficient conditions are satisfied
• Write completion: can detect when write appears on bus
• Write atomicity: if a read returns the value of a write, that write has
already become visible to all others already (can reason different cases)
41
42
21
MESI (4-state) Invalidation Protocol
Problem with MSI protocol
• Reading and modifying data is 2 bus xactions, even if noone sharing
– e.g. even in sequential program
– BusRd (I->S) followed by BusRdX or BusUpgr (S->M)
Add exclusive state: write locally without xaction, but not modified
• Main memory is up to date, so cache not necessarily owner
• States
– invalid
– exclusive or exclusive-clean (only this cache has copy, but not modified)
– shared (two or more caches may have copies)
– modified (dirty)
• I -> E on PrRd if noone else has copy
– needs “shared” signal on bus: wired-or line asserted in response to BusRd
43
BusRdX/Flush
BusRd/Flush
PrWr/—
PrWr/BusRdX
BusRd/
Flush
BusRdX/Flush
PrRd/—
PrWr/BusRdX
′
BusRdX/Flush
PrRd/
BusRd (S
)
PrRd/—
′
BusRd/Flush
PrRd/
BusRd(S)
44
22
Lower-level Protocol Choices
45
4 states
• Exclusive-clean or exclusive (E): I and memory have it
• Shared clean (Sc): I, others, and maybe memory, but I’m not owner
• Shared modified (Sm): I and others but not memory, and I’m the owner
– Sm and Sc can coexist in different caches, with only one Sm
• Modified or dirty (D): I and, noone else
No invalid state
• If in cache, cannot be invalid
• If not present in cache, can view as being in not-present or invalid state
New processor events: PrRdMiss, PrWrMiss
• Introduced to specify actions when block not present in cache
New bus transaction: BusUpd
• Broadcasts single word written on bus; updates other relevant caches
46
23
Dragon State Transition Diagram
PrRd/—
PrRd/— BusUpd/Update
BusRd/—
E Sc
PrRdMiss/BusRd(S) PrRdMiss/BusRd(S)
PrWr/—
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusUpd/Update
BusRd/Flush
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S)
Sm M
PrWr/BusUpd(S)
PrRd/— PrRd/—
PrWr/BusUpd(S)
BusRd/Flush PrWr/—
47
24
Assessing Protocol Tradeoffs
140
120 80
70 Address bus
Traffic (MB/s)
100
60 Data bus
80
50
60 40
30
40
20
20
10
Raytrace/III Ill
Ex
l
t
x
t
Radiosity/3St-RdEx d
OS-Data/3St-RdEx E
0 0
Barnes/3St-RdEx
Barnes/III
LU/III
Radiosity/3St
Radix/III
Radix/3St-RdEx
Raytrace/3St-RdEx
Appl-Data/3St-RdEx
OS-Code/3St-RdEx
Appl-Code/3St
Appl-Data/3St
OS-Code/III
OS-Code/3St
OS-Data/3St
Barnes/3St
LU/3St
Ocean/III
Radiosity/III
Raytrace/3St
LU/3St-RdEx
Ocean/3St-RdEx
Appl-Code/3St-RdEx
Appl-Code/III
OS-Data/III
Radix/3St
Ocean/3S
Appl-Data/III
• MSI versus MESI doesn’t seem to matter for bw for these workloads
• Upgrades instead of read-exclusive helps
• Same story when working sets don’t fit for Ocean, Radix, Raytrace
50
25
Impact of Cache Block Size
Multiprocessors add new kind of miss to cold, capacity, conflict
• Coherence misses: true sharing and false sharing
– latter due to granularity of coherence being larger than a word
• Both miss rate and traffic matter
Reducing misses architecturally in invalidation protocol
• Capacity: enlarge cache; increase block size (if spatial locality)
• Conflict: increase associativity
• Cold and Coherence: only block size
Increasing block size has advantages and disadvantages
• Can reduce misses if spatial locality is good
• Can hurt too
– increase misses due to false sharing if spatial locality not good
– increase misses due to conflicts in fixed-size cache
– increase traffic due to fetching unnecessary data and due to false sharing
– can increase miss penalty and perhaps hit cost
51
First reference to
memory block by processor Reason
for miss
Yes Other
First access
systemwide
2. Cold Yes
No Old copy Yes
with state = invalid
No Modi ed still there
word(s) accessed
during lifetime
3. False-sharing- Yes
cold Modi ed Yes
4. True-sharing- word(s) accessed Modi ed
cold No word(s) accessed Yes
No during lifetime
during lifetime
5. False-sharing- No Has block Yes
been modi ed since
inval-cap 6. True-sharing- replacement
inval-cap 7. Pure-
false-sharing 8. Pure-
true-sharing
Modi ed Yes
No word(s) accessed
during lifetime Modi ed
No
word(s) accessed Yes
during lifetime
9. Pure- 10. True-sharing-
capacity capacity
11. False-sharing-12. True-sharing-
cap-inval cap-inval
26
Impact of Block Size on Miss Rate
Results shown only for default problem size: varied behavior
• Need to examine impact of problem size and p as well (see text)
0.6 12
Upgrade Upgrade
False sharing False sharing
0.5 10
True sharing True sharing
Capacity Capacity
Cold Cold
0.4 8
Miss rate (%)
0.2 4
0.1 2
8
8
0
6
8
8
6
4
0
Radiosity/32
Radiosity/8
Radiosity/256
Barnes/32
Barnes/64
Barnes/128
Lu/16
Lu/32
Lu/256
Ocean/8
Ocean/128
Radix/8
Radix/16
Radix/256
Raytrace/8
Barnes/8
Barnes/256
Lu/8
Lu/64
Lu/128
Ocean/16
Ocean/64
Ocean/256
Radix/32
Radix/64
Radix/128
Raytrace/16
Raytrace/32
Raytrace/64
Raytrace/128
Raytrace/256
Radiosity/64
Barnes/16
Radiosity/16
Ocean/32
Radiosity/128
•Working set doesn’t fit: impact on capacity misses much more critical
53
Data bus
Data bus Data bus
0.14 8
1.4
Traf fic (bytes/instruction)
7
Traf fic (bytes/FLOP)
0.12
1.2
6
0.1 1
5
0.08
0.8
4
0.06 0.6
3
0.04 0.4
2
0.02 0.2
1
2
28
4
0 0 0
Barnes/256
Radiosity/128
Radiosity/256
Raytrace/128
Raytrace/256
Barnes/8
Barnes/16
Barnes/32
Barnes/64
Radiosity/8
Radiosity/64
Raytrace/8
Raytrace/32
Raytrace/64
Barnes/128
Radiosity/16
Radiosity/32
Raytrace/16
LU/8
LU/16
LU/32
LU/64
LU/128
LU/256
Radix/8
Ocean/8
Radix/16
Radix/32
Radix/64
Radix/128
Radix/256
Ocean/16
Ocean/32
Ocean/64
Ocean/128
Ocean/256
• Results different than for miss rate: traffic almost always increases
• When working sets fits, overall traffic still small, except for Radix
• Fixed overhead is significant component
– So total traffic often minimized at 16-32 byte block, not smaller
• Working set doesn’t fit: even 128-byte good for Ocean due to capacity
54
27
Making Large Blocks More Effective
Software
• Improve spatial locality by better data structuring (more later)
• Compiler techniques
Hardware
• Retain granularity of transfer but reduce granularity of coherence
– use subblocks: same tag but different state bits
– one subblock may be valid but another invalid or dirty
• Reduce both granularities, but prefetch more blocks on a miss
• Proposals for adjustable cache size
• More subtle: delay propagation of invalidations and perform all at once
– But can change consistency model: discuss later in course
• Use update instead of invalidate protocols to reduce false sharing effect
55
28
Update vs Invalidate: Miss Rates
0.60 2.50
False sharing
True sharing
0.50
Capacity 2.00
Cold
0.40
Miss rate (%)
0.30
1.00
0.20
0.50
0.10
0.00 0.00
LU/inv
Raytrace/inv
Ocean/mix
Ocean/inv
LU/upd
Radix/inv
Radix/mix
Raytrace/upd
Ocean/upd
Radix/upd
• Lots of coherence misses: updates help
• Lots of capacity misses: updates hurt (keep data in cache uselessly)
• Updates seem to help, but this ignores upgrade and update traffic
57
0.50
1.00
1.50
2.00
2.50
merging
• Overall, trend is away from Raytrace/inv
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
29