0% found this document useful (0 votes)

51 views74 pages

04 Coherence

The document discusses cache coherence and the problem that arises when there are multiple physical copies of the same logical memory location. It provides an example of cache incoherence and explains that coherence aims to make the memory system behave as if there is only one copy. The document also covers software and hardware approaches to cache coherence and describes elements of coherence protocols.

Uploaded by

Zhiyuan Lei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views74 pages

04 Coherence

Uploaded by

Zhiyuan Lei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Cache Coherence
Nima Honarmand
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Cache Coherence: Problem (Review)

• Problem arises when
– There are multiple physical copies of one logical location
• Multiple copies of each cache block (In a shared-mem system)
– One in main memory
– Up to one in each cache
• Copies become inconsistent when writes happen
• Does it happen in a uniprocessor system too?
– Yes, I/O writes can make the copies inconsistent

P1 P2 P3 P4 P1 P2 P3 P4
$ $ $ $
Memory System Memory
Logical View Reality (more or less!)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Coherence: An Example Execution

Processor 0 Processor 1
0: addi r1,accts,r3 CPU0 CPU1 Mem
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
0: addi r1,accts,r3
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
• Two $100 withdrawals from account #241 at two ATMs
– Each transaction maps to thread on different processor
– Track accts[241].bal (address is in r3)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

No-Cache, No-Problem
Processor 0 Processor 1
0: addi r1,accts,r3 500
1: ld 0(r3),r4 500
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3) 400
5: call spew_cash
0: addi r1,accts,r3 400
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4 300
4: st r4,0(r3)
5: call spew_cash

• Scenario I: processors have no caches

– No problem
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Cache Incoherence
Processor 0 Processor 1
0: addi r1,accts,r3 500
1: ld 0(r3),r4 V:500 500
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3) D:400 500
5: call spew_cash
0: addi r1,accts,r3 D:400 V:500 500
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4 D:400 D:400 500
4: st r4,0(r3)
5: call spew_cash
• Scenario II: processors have write-back caches
– Potentially 3 copies of accts[241].bal: memory, P0 $, P1 $
– Can get incoherent (inconsistent)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

But What’s the Problem w/ Incoherence?

• Problem: the behavior of the physical system becomes
different from the logical system
• Loosely speaking, cache coherence tries to hide the
existence of multiple copies (real system)
– And make the system behave as if there is just one copy
(logical system)

P1 P2 P3 P4 P1 P2 P3 P4
$ $ $ $
Memory System Memory
Logical View Reality (more or less!)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

View of Memory in the Logical System

• In the logical system
– For each mem. location M, there is just one copy of the value

• Consider all the reads and writes to M in an execution

– At most one write can update M at any moment
• i.e., there will be a total order of writes to M
• Let’s call them WR1, WR2, …
– A read to M will return the value written by some write (say
WRi)
• This means the read is ordered after WRi and before WRi+1

• The notion of “last write to a location” is globally well-

defined
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Cache Coherence Defined

• Coherence means to provide the same semantic in a
system with multiple copies of M
• Formally, a memory system is coherent iff it behaves as
if for any given mem. location M
– There is a total order of all writes to M
• Writes to M are serialized
– If RDj happens after WRi, it returns the value of WRi or a write
ordered after WRi
– If WRi happens after RDj, it does not affect the value returned
by RDj

• What does “happens after” above mean?

Coherence is only concerned w/ reads & writes on a single location

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Coherence Protocols
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Approaches to Cache Coherence

• Software-based solutions
– compiler or run-time software support

• Hardware-based solutions
– Far more common

• Hybrid solutions
– Combination of hardware/software techniques
– E.g., a block might be under SW coherence first and then
switch to HW cohrence
– Or, hardware can track sharers and SW decides when to
invalidate them
– And many other schemes…

We’ll focus on hardware-based coherence

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Software Cache Coherence

• Software-based solutions
– Mechanisms:
• Add “Flush” and “Invalidate” instructions
• “Flush” writes all (or some specified) dirty lines in my $ to memory
• “Invalidate” invalidate all (or some specified) valid lines in my $
– Could be done by compiler or run-time system
• Should know what memory ranges are shared and which ones are
private (i.e., only accessed by one thread)
• Should properly use “invalidate” and “flush” instructions at
“communication” points
– Difficult to get perfect
• Can induce a lot of unnecessary “flush”es and “invalidate”s →
reducing cache effectiveness
• Know any “cache” that uses software coherence today?
– TLBs are a form of cache and use software-coherence in most
machines
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Hardware Coherence Protocols

• Coherence protocols closely interact with
– Interconnection network
– Cache hierarchy
– Cache write policy (write-through vs. write-back)
• Often designed together

• Hierarchical systems have different protocols at different

levels
– On chip, between chips, between nodes
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Elements of a Coherence Protocol

• Actors Mostly
– Elements that have a copy of memory locations Interconnect
and should participate in the coherence protocol Independent
– For now, caches and main memory
Interconnect
• States Dependent
– Stable: states where there are no on-going transactions
– Transient: states where there are on-going transactions
• State transitions
– Occur in response to local operations or remote messages
• Messages
– Communication between different actors to coordinate state
transitions
• Protocol transactions
– A group of messages that together take system from one stable
state to another
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Coherence as a Distributed Protocol

• Remember, coherence is per memory location
– For now, per cache line

• Coherence protocols are distributed protocols

– Different types of actors have different FSMs
• Coherence FSM of a cache is different from the memory’s
– Each actor maintains a state for each cache block
• States at different actors might be different (local states)
• The overall “protocol state” (global state) is the aggregate of all the
per-actor states
– The set of all local states should be consistent
• e.g., if one actor has exclusive access to a block, every one else
should have the block as inaccessible (invalid)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Coherence Protocols Classification (1)

• Update vs. Invalidate: what happens on a write?
– update other copies, or
– invalidate other copies

• Invalidation is bad when:

– producer and (one or more) consumers of data

• Update is bad when:

– multiple writes by one PE before data is read by another PE
– Junk data accumulates in large caches (e.g. process migration)

• Today, invalidation schemes are by far more common

– Partly because they are easier to implement
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Coherence Protocols Classification (2)

• Broadcast vs. unicast: make the transaction visible…
– to all other processors (a.k.a. snoopy coherence)
• Small multiprocessors (a few cores)
– only those that have a cached copy of the line (aka directory
coherence or scalable coherence)
• > 10s of cores

• Many systems have hybrid mechanisms

– Broadcast locally, unicast globally
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Snoopy Protocols
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Bus-based Snoopy Protocols

• For now assume a one-level coherence hierarchy
– Like a single-chip multicore
– Private L1 caches connected to last level cache/memory
through a bus

• Assume write-back
caches
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Bus-based Snoopy Protocols

• Assume atomic bus
– Request goes out & reply comes back without relinquishing
the bus
• Assume non-atomic request
– It takes while from when a cache makes a request until the bus is
granted and the request goes on the bus
• All actors listen to (snoop) the bus requests and change
their local state accordingly
– And if need be provide replies
• Shared bus and its being atomic makes it easy to enforce
write serialization
– Any write that goes on the bus will be seen by everyone at the
same time
– We say bus is the point of serialization in the protocol
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Example 1: MSI Protocol

• Three states tracked per-block at each cache and LLC
– Invalid – cache does not have a copy
– Shared – cache has a read-only copy; clean
• Clean == memory is up to date
– Modified – cache has the only copy; writable; dirty
• Dirty == memory is out of date

• Transactions
– GetS(hared), GetM(odified), PutM(odified)

• Messages
– GetS, GetM, PutM, Data (data reply)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Describing Coherence Protocols

• Two methods
– High-level: interaction between stable states and transactions
• Often shown as an FSM diagram
– Detailed: complete specification including transient states
and messages in addition to the above
• Often shown as a table

• Either way, should always describe protocol transitions

and states for each actor type
Fall 2015 :: CSE 610 – Parallel Computer Architectures

MSI: High-Level Spec (Style 1)

M
• High-level state transitions in Other-GetM
or
Own-GetM

response to requests on the bus Own-PutM

Own-GetM
– Not showing local processor Other-GetS
load/stores/evicts explicitly S
– Now showing responses Own-GetS
silent

• Mem state aggregate of cache I

states
– “I” at mem = all caches are “I”; FSM at cache controller
“S” at mem = “S” in some caches;
“M” at mem = “M” in one cache. M
GetM
• Own means observing my cache’s GetS or PutM
own request; Other means
another cache’s request I or S
FSM at memory controller
Fall 2015 :: CSE 610 – Parallel Computer Architectures

MSI: Detailed Specification (1/2)

• Detailed specification provides complete state transition
+ actions to be taken on a transition + transient states
• ABX means a transient state during transition from state
A to B which is waiting for event(s) X before moving to B

Source: A Primer on Memory Consistency and Cache Coherence

Memory Controller Detailed Spec

Fall 2015 :: CSE 610 – Parallel Computer Architectures

MSI: Detailed Specification (2/2)

Source: A Primer on Memory Consistency and Cache Coherence

Cache Controller Detailed Spec

Fall 2015 :: CSE 610 – Parallel Computer Architectures

MSI: High-level Spec (Style 2)

• Only shows $ transitions; mem transitions must be inferred
• “X”/ “Y” means do “Y” in response to “X”
LD / BusRd
BusRd / [BusReply]
I BusInv, BusRdX / [BusReply] S
Evict / --
BusRdX / BusReply

LD / --
ST / BusRdX

Evict / BusWB

• Local proc actions:

– LD, ST, Evict
• Bus Actions:
– BusRd, BusRdX, BusInv,
BusWB, BusReply
M
Often you see FSMs like these
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Example 2: Illinois Protocol

• States: I, E (Exclusive), S (Shared), M (Modified)
– Called MESI 
– Widely used in real machines
• Two features :
– The cache knows if it has an Exclusive (E) copy
– If some cache has a copy in E state, cache-cache transfer is used

• Advantages:
– In E state no invalidation traffic on write-hits
• Cuts down on upgrade traffic for lines that are first read and then written
– Closely approximates traffic on a uniprocessor for sequential programs
– Cache-cache transfer can cut down latency in some machine
• Disadvantages:
– complexity of mechanism that determines exclusiveness
– memory needs to wait before sharing status is determined
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Illinois: High-Level Specification

Other-GetM
or
M E/M
Own-PutM Own-GetM
silent
GetS GetM
Other-GetS
E Own-GetM
S
Other-GetS
Own-GetS PutM GetS or GetM
Other-GetM (Mem Replies)
or S I
Own-PutM
Own-GetS
silent (Cache Replies)

I
FSM at cache controller FSM at memory controller
See the “Primer” (Sec 7.3) for the detailed spec
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Adding an “Owned” State: MOESI

• MESI must write-back to memory on MS transition
(a.k.a. downgrade)
– Because protocol allows “silent” evicts from shared state,
a dirty block might otherwise be lost
– But, the writebacks might be a waste of bandwidth
• e.g., if there is a subsequent store (common in producer-
consumer scenarios)

• Solution: add an “Owned” state

– Owned – shared, but dirty; only one owner (others enter S)
– Owner is responsible for replying to read requests
– Owner is responsible for writeback upon eviction
• Or should transfer “ownership” to another cache
Fall 2015 :: CSE 610 – Parallel Computer Architectures

MOESI Framework
[Sweazey & Smith, ISCA’86]

M - Modified (dirty)
O - Owned (dirty but shared)
E - Exclusive (clean unshared) only copy, not dirty
S - Shared
I - Invalid ownership
O
Variants
– MSI S M validity

– MESI
– MOSI E exclusiveness
– MOESI I
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Example 3: DEC Firefly

• An update-based protocol for write-back caches
• States
– Exclusive – only one copy; writeble; clean
– Shared – multiple copies; write hits write-through to all
sharers and memory
– Dirty – only one copy; writeable; dirty
• Exclusive/dirty provide write-back semantics for private
data
• Shared state provides update semantics for shared data
– Uses “shared line” bus wire to detect sharing status
Fall 2015 :: CSE 610 – Parallel Computer Architectures

DEC Firefly: High-level Specification

LD Miss / BusRd (if SL)
BusRd, BusWr / BusReply
ST / BusWr (if SL)
E ST / BusWr (if !SL) S BusRd / BusReply
BusWr / Snarf

ST Miss / BusRd followed

by BusWr (if SL)

• SharedLine (SL) is checked on

any bus request to determine
sharing
D
• Write miss is handled as a read
ST Miss / BusRd (if not SL) miss followed by a write.
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Advanced Issues
in Snoopy Coherence
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Non-Atomic Interconnect
• (A) entries in the previous tables: situations that cannot
happen because bus is atomic
– i.e., bus is not released until the transaction is complete
– Cannot have multiple on-going requests for the same line

• Atomic buses waste time and bus bandwidth

– Responding to a request involves multiple actions
• Look up cache tags on a snoop
• Inform upper cache layers (if multi-level cache)
• Access lower levels (e.g., LLC accessing memory before replying)
Atomic Bus
Req
Delay
Response
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Split-Transaction Buses
• Can overlap req/resp of multiple transactions on the bus
– Need to identify request/response using a tag
Split-transaction Bus
Req 1 Req 2 Req 3
Rep 3 Rep 1 ...
Issues:
• Protocol races become possible
– Protocol races result in more transient states
• Need to buffer requests and responses
– Buffer own reqs: req bus might be busy (taken by someone else)
– Buffer other actors’ reqs: busy processing another req
– Buffer resps: resp bus might be busy (taken by someone else)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

More Races: MSI w/ Split-Txn Bus

• (A) entries
are now
possible
• New
transient
states

Source: A Primer on Memory

Consistency and Cache Coherence

Cache Controller Detailed Spec

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Ordering Issues (1)

• How to ensure “write serialization” property?
– Recall: writes to the same location should be appear in the
same order to all caches

• Solution: have a FIFO queue for all snooped requests

– Own as well as others’
• Add snooped requests at the FIFO tail
• Process each requests when at the FIFO head

→ All controllers process all the reqs in the same order

– i.e., the bus order
– Bus is the point of serialization
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Buffering Issues (2)

• What to do if the buffers are full?
– Someone puts a req on the bus but my buffer is full
• NACK the request (sender should repeat)
• Stall the request/bus until buffer space available
– I have a resp but my resp buffer is full
• Stall processing further requests until space is available

• Problem: Stalling can result in deadlock if not careful

– Deadlock: when two or more actors are circularly waiting on each
other and cannot make progress
• Problem: NACKing can result in livelock if not careful
– Livelock: when two or more actors are busy (e.g., sending
messages) but cannot make forward progress
• Both caused by cyclic dependences
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Buffering Issues (2) – Cont’d

• Common solution
– Separate req & resp networks
– Separate incoming req & resp queues
– Separate outgoing req & resp queues
– Make sure protocol can always absorb resps
• e.g., resps should not
block for writebacks if
replacement needed

Example Split-Txn System:

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Multi-level Cache Hierarchies

• How to snoop with multi-level caches?
- Assume private L1 and L2
- L2 is the point of coherence

Common solutions:
1. Independent snooping at each level
– Basically forwarding each snooped request to higher levels
2. Duplicate L1 tags at L2
3. Maintain cache inclusion
Fall 2015 :: CSE 610 – Parallel Computer Architectures

The Inclusion Property

• Inclusion means L2 is a superset of L1 (ditto for L3, …)
– Also, must propagate “dirty” bit through cache hierarchy

✓ Only need to snoop L2

– If L2 says not present, can’t be in L1 either

 Inclusion wastes space

 Inclusion takes effort to maintain
– L2 must track what is cached in L1
– On L2 replacement, must flush corresponding blocks from L1
– Complicated due to (if):
• L1 block size < L2 block size
• Different associativity in L1
• L1 filters L2 access sequence; affects LRU ordering

Many recent designs do not maintain inclusion

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Handling Writebacks
• Allow CPU to proceed on a miss ASAP
– Fetch the requested block
– Do the writeback of the victim later

• Requires writeback (WB) buffer

– Must snoop/handle bus transactions in WB buffer

• When to allocate WB buffer entry?

– When sending request or upon receiving response?
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Non-Bus Snoopy Coherence

• Snoopy coherence does not need bus
• It needs totally ordered logical broadcasting of requests

→ Any request network w/ totally ordered broadcasts work

→ Response network can be completely unordered

• Example 1: Sun E10K

– Tree-based
point-to-point
ordered req network
– Crossbar unordered
data network

• Example 2: Ring-based on-chip protocols in modern multicores

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Directory Coherence
Protocols
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Problems w/ Snoopy Coherence

1. Bus bandwidth
– Problem: Bus and Ring are not scalable interconnects
• Limited bandwidth
• Cannot support more than a dozen or so processors
– Solution: Replace non-scalable bandwidth substrate (bus)
with a scalable-bandwidth one (e.g., mesh)

2. Processor snooping bandwidth

– Problem: All processors must monitor all bus traffic; most
snoops result in no action
– Solution: Replace non-scalable broadcast protocol (spam
everyone) with scalable directory protocol (only spam cores
that care)
• The “directory” keeps track of “sharers”
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Directory Coherence 101

1. Maintain a global view of the coherence state of
each block in a Directory
– Owner: which processor has a dirty copy (i.e., M state)
– Sharers: which processors have clean copies (i.e., S state)

2. Instead of broadcasting, processors send coherence

requests to the directory
– Directory then informs other actors that care

3. Used with point-to-point networks (almost always)

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Basic Operation: Read Clean Data

Node #1 Directory Node #2
Load A (miss)

A: Shared; #1
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Basic Operation: Write

Node #1 Directory Node #2
Load A (miss)

A: Shared, #1

Write A (miss)

A: Modified; #2
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Basic Operation: Read Dirty Data

Node #1 Directory Node #2
Load A (miss)

A: Shared, #1

A: Modified; #2

Load A (miss)

A: Shared; #1, 2
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Centralized vs. Distributed Directory

• Centralized: Single directory tracks sharing state for all
memory locations
✓ Central serialization point: easy to get memory consistency
 Not scalable (imagine traffic from 1000’s of nodes…)
 Directory size/organization changes with number of nodes

• Distributed: Distribute directory among multiple nodes

✓ Scalable – directory size and BW grows with memory capacity
 Directory can no longer serialize accesses across all addresses
• Memory consistency becomes responsibility of CPU interface
(more on this later)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

More Nomenclature
• Local Node (L)
– Node initiating the transaction we care about
• Home Node (H)
– Node where directory/main memory for the block lives
• Remote Node (R)
– Any other node that participates in the transaction

• 3-hop vs. 4-hop

– Refers to the number of messages on the critical path of a
transaction
Fall 2015 :: CSE 610 – Parallel Computer Architectures

4-hop vs. 3-hop Protocols State: M

Owner: R

• Consider a cache miss 1: Get-S 2: Recall

• L has a cache miss on L H R

a load instruction
– Block was previously
4: Data 3: Data
in modified state at R
State: M
Owner: R
• 3-hop protocols have
1: Get-S
higher performance 2: Fwd-GetS
but can be harder to
L H R
get right
3: Data
3: Data
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Main Challenges of Directory

• More complex protocol (compared to snoopy)
– Protocols have more message types
– Transactions involve more messages
– More actors should talk to each other to complete a transaction
• Deal with many possible race cases due to
– Complex protocols
– Complex network behavior (e.g., network can deliver messages out of
order)
→ more transient states
• How to provide write serialization for coherence?
– Directory acts as the point of serialization
– Has to make sure anyone sees the writes in the same order as directory
does
• Avoid deadlocks and livelocks
– More difficult due to protocol and network complexity
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Example: A 3-Hop MSI Protocol

• Same three stable states as before: M, S and I
– Directory owns a block unless in M state
– Directory entry contains: stable coherence state, owner (if in M), sharers
list (if in S)
• Transactions
– GetS(hared), GetM(odified), PutM(odified), PutS(hared)
• Messages
– GetS, GetM, PutM, PutS,
Fwd-GetS, Fwd-GetM, Inv,
Put-Ack, Inv-Ack,
Data (from Dir or from Owner)
• Separate logical networks for Reqs, Forwarded Reqs and Reps
– Networks can be physically separate, or,
– Use Virtual Channels to share a physical network
Fall 2015 :: CSE 610 – Parallel Computer Architectures

3-Hop MSI High-Level Spec (1/2)

I→S

M or S → I
Fall 2015 :: CSE 610 – Parallel Computer Architectures

3-Hop MSI High-Level Spec (2/2)

• For S → M, dir
sends the
AckCount to the
requestor
– AckCount =
# sharers

• Requestor I or S → M
collects the Inv-
Acks
Fall 2015 :: CSE 610 – Parallel Computer Architectures

3-Hop MSI Detailed Spec (1/2)

Source: A Primer on Memory Consistency and Cache Coherence

Cache Controller Detailed Spec

Fall 2015 :: CSE 610 – Parallel Computer Architectures

3-Hop MSI Detailed Spec (2/2)

Source: A Primer on Memory Consistency and Cache Coherence

Memory Controller Detailed Spec

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Clean Eviction Notification (PutS)

• Should directory learn when clean blocks are evicted?

• Advantages:
– Allows directory to remove cache from sharers list
• Avoids unnecessary invalidate messages in the future
• Helps with providing line in E state in MESI protocol
• Helps with limited-pointer directories (in a few slides)
– Simplifies the protocol

• Disadvantages:
– Notification traffic is unnecessary if block is re-read before
anyone writes to it
• Read-only data never invalidated (extra evict messages)
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Ordered vs. Unordered Network

• A network is ordered if messages sent from src to dst,
for any (src, dst) pair, are always delivered in order
– Otherwise, the network is unordered
– e.g., adaptive routing can
make a network unordered

• So far, we’ve assumed ordered networks

• Unordered networks can cause more races
• Example: consider re-ordered PutM-Ack and Fwd-GetM
during a “writeback/store” race in the previous MSI
protocol
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Directory
Implementation
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Sharer-List Representation
• How to keep track of the sharers of a cache line?

• Full Bit Vector scheme: One bit of directory memory

per main-memory block per actor
0 1 1 0 0 0 0 1
– Not scalable
– List can be very long for very large (1000s of nodes) systems
– Searching the list to find the sharers can become an overhead

• Coarse Bit Vector scheme: Each bit represents multiple

sharers
1 1 0 1
– Reduces overhead by a constant factor
– Still not very scalable
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Limited-Pointer Schemes (1/3)

• Observation: each cache line is very often only shared
by a few actors
→ Only store a few pointers per cache line

• Overflow strategy: what to do when there are more

sharers than pointers?
– Many different solutions
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Limited-Pointer Schemes (2/3)

• Classification: Dir<num-pointers><Action-upon-overflow>

• DiriB (B = Broadcast):
– Beyond i pointers, set the inval-broadcast bit ON
– Expected to do well since widely shared data is not written
often

• DiriNB (NB = No Broadcast)

– When sharers exceed i, invalidate one of the existing sharers
– Significant degradation for widely-shared, mostly-read data
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Limited-Pointer Schemes (3/3)

• DiriCVr (CV = Coarse Vector)
– When sharers exceed i, use the bits as a coarse vector
• r : # of actors that each bit in the coarse vector represents
– Always results in less coherence traffic than DiriB
– Example: Dir3CV4 for 64 processors

• DiriSW (SW = Software)

– When sharers exceed i, trap to software
– Software can maintain full sharer list in software-managed
data structures
– Trap handler needs full access to the directory controller
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Linked-List Schemes (Historical)

• Each cache frame has fixed storage for next (prev) sharer
– Directory has a pointer to
the head of the list
X X
– Can be combined with limited-
pointer schemes (to handle overflow)
• Variations:
– Doubly-linked (Scalable Coherent Interconnect)
– Singly-linked (S3.mp)
 Poor performance
– Long invalidation latency
– Replacements – difficult to get out of sharer list
• Especially with singly-linked list… – how to do it?
 Difficult to verify
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Directory Organization
• Logically, directory has an entry for every block of memory
– How to implement this?
• Dense directory: one dir entry per physical block
– Merge dir controller with mem controller and store dir in RAM
– Older implementations were like this (like SGI Origin)
– Can use ECC bits to avoid adding extra DRAM chips for directory
 Drawbacks:
– Shared accesses need checking directory in RAM
• Slow and power-hungry
• Even when access itself is served by cache-to-cache transactions
– Most memory blocks not cached anywhere → waste of space
– Example: 16 MB cache, 4 GB mem. → 99.6% idle

→ Does not make much sense for todays machines

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Sparse Directories
• Solution: Cache the directory info in fast (on-chip) mem
– Avoids most off-chip directory accesses
• DRAM-backed directory cache
– On a miss should go to the DRAM directory
 Still wastes space
 Incurs DRAM writes on a dir cache replacement
• Non-DRAM-backed directory cache
– A miss means the line is not cached anywhere
– On a dir cache replacement, should invalidate the cache line corresponding to
the replaced dir entry in the whole system
✓ No DRAM directory access
 Extra invalidations due to limited dir space
• Null directory cache
– No directory entries → Similar to Dir0B
– All requests broadcasted to everyone
– Used in AMD and Intel’s inter-socket coherence (HyperTransport and QPI)
 Not scalable but ✓ simple
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Sharing and
Cache-Invalidation
Patterns
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Cache Invalidation Patterns

• Hypothesis: On a write to a shared location, # of caches
to be invalidated is typically small
• If not true, unicast (directory) is not much better than
broadcast (snoopy)

• Experience tends to validate this hypothesis

Fall 2015 :: CSE 610 – Parallel Computer Architectures

Common Sharing Patterns

• Code and read-only objects
– No problem since rarely written
• Migratory objects
– Even as number of caches grows, only 1-2 invalidations
• Mostly-read objects
– Invalidations are expensive but infrequent, so OK
• Frequently read/written objects (e.g., task queues)
– Invalidations frequent, hence sharer list usually small
• Synchronization objects
– Low-contention locks result in few invalidations
– High contention locks may need special hardware support or
complex software design (next lecture)
• Badly-behaved objects
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Misses on Shared Data

• Assume infinite caches

• Uniprocessor misses: cold (compulsory)

– Ignoring capacity and conflict misses due to infinite cache
• Multiprocessing adds a new one: coherence misses
– When cache misses on an invalidated or
without-sufficient-permission line

• Reasons for coherence misses

– True sharing
– False sharing
• Due to prefetching effects of multi-word cache blocks
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Effects of Multi-Word Cache Blocks

Single-word blocks Multi-word blocks

Misses Hits Misses Hits

Cold

True Sharing

False Sharing
Fall 2015 :: CSE 610 – Parallel Computer Architectures

How to Improve
• By changing layout of shared variables in memory

• Reduce false sharing

– Scalars with false sharing: put in different lines
– Synchronization variables: put in different lines
– Heap allocation: per-thread heaps vs. global heaps
– Padding: padding structs to cache-line boundaries

• Improve the spatial locality of true sharing

– Scalars protected by locks: pack them in the same line as the
lock variable
Fall 2015 :: CSE 610 – Parallel Computer Architectures

Reducing Coherence Miss Cost

• By discovering the sharing pattern
– In HW: similar to branch prediction, look for access patterns
– SW hints: compiler/programmer knows the pattern and tells the HW
– Hybrid of the two

Examples:
• Migratory pattern
– On a remote read, self invalidate + pass in E state to requester
• Producer-Consumer pattern
– Keep track of prior readers
– Forward data to prior readers upon downgrade
• Last-touch prediction
– Once an access is predicted as last touch, self-invalidate the line
– Makes next processor’s access faster: 3-hop → 2-hop

978 3 031 01725 4
No ratings yet
978 3 031 01725 4
137 pages
LC 1
No ratings yet
LC 1
337 pages
1.symmetric and Distributed Shared Memory Architectures
79% (19)
1.symmetric and Distributed Shared Memory Architectures
29 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
L04 Parallel Systems Synchronization Communication Scheduling
No ratings yet
L04 Parallel Systems Synchronization Communication Scheduling
117 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
06 Consistency
No ratings yet
06 Consistency
46 pages
Week4 1
No ratings yet
Week4 1
37 pages
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
No ratings yet
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
79 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
First Time Chip Bringup Success
No ratings yet
First Time Chip Bringup Success
22 pages
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
No ratings yet
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
40 pages
Lecture12 PDF
No ratings yet
Lecture12 PDF
9 pages
07 Introduction To Multicore Programming PDF
No ratings yet
07 Introduction To Multicore Programming PDF
60 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
The Lecture Contains:: Module 9: Addendum To Module 6: Shared Memory Multiprocessors
No ratings yet
The Lecture Contains:: Module 9: Addendum To Module 6: Shared Memory Multiprocessors
8 pages
Week 5
No ratings yet
Week 5
52 pages
Cs 6461 Computer Architecture Lecture 11
No ratings yet
Cs 6461 Computer Architecture Lecture 11
51 pages
2 Timing
No ratings yet
2 Timing
42 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
MODULE 4 HPC
No ratings yet
MODULE 4 HPC
41 pages
Cache Coherency
No ratings yet
Cache Coherency
19 pages
A Survey of Cache Coherence Mechanisms in Shared M
No ratings yet
A Survey of Cache Coherence Mechanisms in Shared M
27 pages
4-Module #4-Shared-Memory-Students-Version-Final-October-24-2024
No ratings yet
4-Module #4-Shared-Memory-Students-Version-Final-October-24-2024
25 pages
Cache Coherency Presentation
No ratings yet
Cache Coherency Presentation
15 pages
2.symmetric Shared Memory Architectures
No ratings yet
2.symmetric Shared Memory Architectures
12 pages
Pattern Based Cache Coherency Architectu
No ratings yet
Pattern Based Cache Coherency Architectu
13 pages
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
14 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
Mehmet Senvar - Cache Coherence Protocols
No ratings yet
Mehmet Senvar - Cache Coherence Protocols
30 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
Cache Coherence - MESI MOESI
No ratings yet
Cache Coherence - MESI MOESI
57 pages
Coherence
No ratings yet
Coherence
16 pages
R12 U5 MultiProcessor Architectures
No ratings yet
R12 U5 MultiProcessor Architectures
47 pages
Cache Coherence: - According To Webster's Dictionary
No ratings yet
Cache Coherence: - According To Webster's Dictionary
15 pages
Chapter 4: Multiprocessor: Dr. Eng. Amr T. Abdel-Hamid Spring 2011
No ratings yet
Chapter 4: Multiprocessor: Dr. Eng. Amr T. Abdel-Hamid Spring 2011
22 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
No ratings yet
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
42 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
17 pages
Shared Memory Architecture
No ratings yet
Shared Memory Architecture
39 pages
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
No ratings yet
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
21 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
Shared Memory Multiprocessors
No ratings yet
Shared Memory Multiprocessors
45 pages
KTMTSS Shared Memory Multiprocessor
No ratings yet
KTMTSS Shared Memory Multiprocessor
29 pages
Phase-Priority Based Directory Coherence For Multicore Processor
No ratings yet
Phase-Priority Based Directory Coherence For Multicore Processor
15 pages
Final Report: Multicore Processors
No ratings yet
Final Report: Multicore Processors
12 pages
Term Paper: Cahe Coherence Schemes
No ratings yet
Term Paper: Cahe Coherence Schemes
12 pages
Cache Coherence (Part 1)
No ratings yet
Cache Coherence (Part 1)
13 pages
Cache Coherence: CSE 661 - Parallel and Vector Architectures
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
37 pages
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
No ratings yet
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
31 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Cache Coherence Protocols: Evaluation Using A Multiprocessor Simulation Model
No ratings yet
Cache Coherence Protocols: Evaluation Using A Multiprocessor Simulation Model
26 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
Cache Coherence and Synchronization - Tutorialspoint
No ratings yet
Cache Coherence and Synchronization - Tutorialspoint
7 pages
Consistency vs. Coherence: Example: Two Processors Are Synchronizing On A Variable Called
No ratings yet
Consistency vs. Coherence: Example: Two Processors Are Synchronizing On A Variable Called
12 pages
Parallel 2
No ratings yet
Parallel 2
14 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
2.4.6 Cache Coherence in Multiprocessor Systems
No ratings yet
2.4.6 Cache Coherence in Multiprocessor Systems
3 pages