04 Coherence
04 Coherence
Cache Coherence
Nima Honarmand
Fall 2015 :: CSE 610 – Parallel Computer Architectures
P1 P2 P3 P4 P1 P2 P3 P4
$ $ $ $
Memory System Memory
Logical View Reality (more or less!)
Fall 2015 :: CSE 610 – Parallel Computer Architectures
No-Cache, No-Problem
Processor 0 Processor 1
0: addi r1,accts,r3 500
1: ld 0(r3),r4 500
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3) 400
5: call spew_cash
0: addi r1,accts,r3 400
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4 300
4: st r4,0(r3)
5: call spew_cash
Cache Incoherence
Processor 0 Processor 1
0: addi r1,accts,r3 500
1: ld 0(r3),r4 V:500 500
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3) D:400 500
5: call spew_cash
0: addi r1,accts,r3 D:400 V:500 500
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4 D:400 D:400 500
4: st r4,0(r3)
5: call spew_cash
• Scenario II: processors have write-back caches
– Potentially 3 copies of accts[241].bal: memory, P0 $, P1 $
– Can get incoherent (inconsistent)
Fall 2015 :: CSE 610 – Parallel Computer Architectures
P1 P2 P3 P4 P1 P2 P3 P4
$ $ $ $
Memory System Memory
Logical View Reality (more or less!)
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Coherence Protocols
Fall 2015 :: CSE 610 – Parallel Computer Architectures
• Hardware-based solutions
– Far more common
• Hybrid solutions
– Combination of hardware/software techniques
– E.g., a block might be under SW coherence first and then
switch to HW cohrence
– Or, hardware can track sharers and SW decides when to
invalidate them
– And many other schemes…
Snoopy Protocols
Fall 2015 :: CSE 610 – Parallel Computer Architectures
• Assume write-back
caches
Fall 2015 :: CSE 610 – Parallel Computer Architectures
• Transactions
– GetS(hared), GetM(odified), PutM(odified)
• Messages
– GetS, GetM, PutM, Data (data reply)
Fall 2015 :: CSE 610 – Parallel Computer Architectures
LD / --
ST / BusRdX
Evict / BusWB
• Advantages:
– In E state no invalidation traffic on write-hits
• Cuts down on upgrade traffic for lines that are first read and then written
– Closely approximates traffic on a uniprocessor for sequential programs
– Cache-cache transfer can cut down latency in some machine
• Disadvantages:
– complexity of mechanism that determines exclusiveness
– memory needs to wait before sharing status is determined
Fall 2015 :: CSE 610 – Parallel Computer Architectures
I
FSM at cache controller FSM at memory controller
See the “Primer” (Sec 7.3) for the detailed spec
Fall 2015 :: CSE 610 – Parallel Computer Architectures
MOESI Framework
[Sweazey & Smith, ISCA’86]
M - Modified (dirty)
O - Owned (dirty but shared)
E - Exclusive (clean unshared) only copy, not dirty
S - Shared
I - Invalid ownership
O
Variants
– MSI S M validity
– MESI
– MOSI E exclusiveness
– MOESI I
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Advanced Issues
in Snoopy Coherence
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Non-Atomic Interconnect
• (A) entries in the previous tables: situations that cannot
happen because bus is atomic
– i.e., bus is not released until the transaction is complete
– Cannot have multiple on-going requests for the same line
Split-Transaction Buses
• Can overlap req/resp of multiple transactions on the bus
– Need to identify request/response using a tag
Split-transaction Bus
Req 1 Req 2 Req 3
Rep 3 Rep 1 ...
Issues:
• Protocol races become possible
– Protocol races result in more transient states
• Need to buffer requests and responses
– Buffer own reqs: req bus might be busy (taken by someone else)
– Buffer other actors’ reqs: busy processing another req
– Buffer resps: resp bus might be busy (taken by someone else)
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Common solutions:
1. Independent snooping at each level
– Basically forwarding each snooped request to higher levels
2. Duplicate L1 tags at L2
3. Maintain cache inclusion
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Handling Writebacks
• Allow CPU to proceed on a miss ASAP
– Fetch the requested block
– Do the writeback of the victim later
Directory Coherence
Protocols
Fall 2015 :: CSE 610 – Parallel Computer Architectures
A: Shared; #1
Fall 2015 :: CSE 610 – Parallel Computer Architectures
A: Shared, #1
Write A (miss)
A: Modified; #2
Fall 2015 :: CSE 610 – Parallel Computer Architectures
A: Shared, #1
A: Modified; #2
Load A (miss)
A: Shared; #1, 2
Fall 2015 :: CSE 610 – Parallel Computer Architectures
More Nomenclature
• Local Node (L)
– Node initiating the transaction we care about
• Home Node (H)
– Node where directory/main memory for the block lives
• Remote Node (R)
– Any other node that participates in the transaction
I→S
M or S → I
Fall 2015 :: CSE 610 – Parallel Computer Architectures
• Requestor I or S → M
collects the Inv-
Acks
Fall 2015 :: CSE 610 – Parallel Computer Architectures
• Advantages:
– Allows directory to remove cache from sharers list
• Avoids unnecessary invalidate messages in the future
• Helps with providing line in E state in MESI protocol
• Helps with limited-pointer directories (in a few slides)
– Simplifies the protocol
• Disadvantages:
– Notification traffic is unnecessary if block is re-read before
anyone writes to it
• Read-only data never invalidated (extra evict messages)
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Directory
Implementation
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Sharer-List Representation
• How to keep track of the sharers of a cache line?
• DiriB (B = Broadcast):
– Beyond i pointers, set the inval-broadcast bit ON
– Expected to do well since widely shared data is not written
often
Directory Organization
• Logically, directory has an entry for every block of memory
– How to implement this?
• Dense directory: one dir entry per physical block
– Merge dir controller with mem controller and store dir in RAM
– Older implementations were like this (like SGI Origin)
– Can use ECC bits to avoid adding extra DRAM chips for directory
Drawbacks:
– Shared accesses need checking directory in RAM
• Slow and power-hungry
• Even when access itself is served by cache-to-cache transactions
– Most memory blocks not cached anywhere → waste of space
– Example: 16 MB cache, 4 GB mem. → 99.6% idle
Sparse Directories
• Solution: Cache the directory info in fast (on-chip) mem
– Avoids most off-chip directory accesses
• DRAM-backed directory cache
– On a miss should go to the DRAM directory
Still wastes space
Incurs DRAM writes on a dir cache replacement
• Non-DRAM-backed directory cache
– A miss means the line is not cached anywhere
– On a dir cache replacement, should invalidate the cache line corresponding to
the replaced dir entry in the whole system
✓ No DRAM directory access
Extra invalidations due to limited dir space
• Null directory cache
– No directory entries → Similar to Dir0B
– All requests broadcasted to everyone
– Used in AMD and Intel’s inter-socket coherence (HyperTransport and QPI)
Not scalable but ✓ simple
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Sharing and
Cache-Invalidation
Patterns
Fall 2015 :: CSE 610 – Parallel Computer Architectures
Cold
True Sharing
False Sharing
Fall 2015 :: CSE 610 – Parallel Computer Architectures
How to Improve
• By changing layout of shared variables in memory
Examples:
• Migratory pattern
– On a remote read, self invalidate + pass in E state to requester
• Producer-Consumer pattern
– Keep track of prior readers
– Forward data to prior readers upon downgrade
• Last-touch prediction
– Once an access is predicted as last touch, self-invalidate the line
– Makes next processor’s access faster: 3-hop → 2-hop