Coherence
Coherence
x = 2; /* initially */
y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???
P ro c e s s o r P ro c e s s o r P ro c e s s o r
M e m o ry I/ O
0 1
1
Approaches to cache coherence
P1 P2 Pn
$ $ $
network/bus
Memory
Single bus
Memory I/O
2
An Example Snoopy Protocol
• Invalidation protocol for write-back caches
• Each block of memory can be:
– Clean in one or more caches and up-to-date in memory, or
– Dirty in exactly one cache, or
– Uncached: not in any caches
• Correspondingly, we record the sate of each block in a cache as:
– Invalid : block contains no valid data,
– Shared : a clean block (can be shared by other caches), or
– Exclusive/Modified: a dirty block (cannot be in any other cache)
3
Example
• Assumes that blocks B1 and B2 map to same cache location L.
• Initially neither B1 or B2 is cached
• Block size = one word
Example (cont.)
Describe the messages that show up on the bus after each event
Message types: read request, write request, write back
8
4
Hierarchical Snooping
P P P P
L1 L1 L1 L1
B1 B1
L2 L2
B2
Directory-based Coherence
• Assumes a shared memory space which is physically distributed
$ P0 $ P1 $ Pn
Shared
Memory Memory Memory
memory
Network
• Idea: Implement a “directory” that keeps track of where each copy of a block is
cached and its state in each cache (note that with snooping, the state of a block was
kept only in the cache).
• Processors must consult directory before loading blocks from memory to cache
• When a block in memory is updated (written), the directory is consulted to either
update or invalidate other cached copies.
• Eliminates the overhead of broadcasting/snooping (bus bandwidth) – Hence, scales
up to numbers of processors that would saturate a single bus.
• But is slower in terms of latency
10
5
Directory-based Coherence
• The memory and the directory can be centralized
$ P0 $ P1 $ Pn
Network
Mem Dir
• Or distributed
$ P0 $ P1 $ Pn
Shared
Mem Dir Mem Dir Mem Dir
memory
Network
• Alternatively, the memory may be distributed but the directory can be centralized.
• Or the memory may be centralized but the directory can be distributed (as we will
discuss in the case of CMP with private L2 caches) 11
$ P0 $ P1 $ Pn
Network
• As in snooping caches, the state of every block in every cache is tracked in that
cache (exclusive/dirty, shared/clean, invalid) – to avoid the need for write
through and unnecessary write back.
• In addition, with each block in memory, a directory entry keeps track of where
the block is cached. Accordingly, a block can be in one of the following states:
• Uncached: no processor has it (not valid in any cache)
• Shared/clean: cached in one or more processors and memory is up-to-date
• Exclusive/dirty: one processor (owner) has data; memory out-of-date
12
6
Enforcing coherence
• Coherence is enforced by exchanging messages between nodes
• Three types of nodes may be involved
• Local requestor node (L): the node that reads or write the cache block
• Home node (H): the node that stores the block in its memory -- may
be the same as L
• Remote nodes (R): other nodes that have a cached copy of the
requested block.
13
1
Request to
(b) Read miss to a block that is exclusive L Home node
H
(in another cache) Return owner
3 4
-- L sends request to H Request Return 2
-- H informs L about the block owner, R to owner data
Revise
-- L requests the block from R entry
4
-- R send the block to L R
-- L and R set the state of block to “shared”
-- R informs H that it should change the state
of the block to “shared”
14
7
What happens on a write miss?
(when block is invalid in local cache)
(a) Write miss to an uncached block
-- similar to a read miss to an uncached block except that the state of the block
is set to “exclusive”
(b) Write miss to an block that is exclusive in another cache
-- similar to a read miss to an exclusive block except that the state of the block
is set to “exclusive”
1
(c) Write miss to a shared block Request to
-- L sends request to H L Home node H
3
-- H sets the state to “exclusive” Return sharers
Invalidate and data
-- H sends the block to L 3 2
-- H sends to L the list of other sharers ack
ack Invalidate
-- L sets the block’s state to “exclusive” 4 4
-- L sends invalidating messages to each R R
sharers (R)
-- R sets block’s state to “invalid”
15
16
8
Directory-based coherence - example
Case 1:
X is in the uncached (U) state in home directory
Pi
Home of X U
Pj $$ dir
Pk
Possible scenario:
• Pj reads X
• Then Pj writes to X
17
Pj $$ dir
X d
Keeps track of Keeps track of
State of cached blocks where X is cached
18
9
Directory-based coherence - example
Case 3:
X is exclusive (E) in home directory
and owned by Pj (dirty, d, in Pj )
Pi
Home of X E
Pj $$ dir
X d
Keeps track of Keeps track of
State of cached blocks where X is cached
19
$$ dir
Pj
X c
Keeps track of Keeps track of
State of cached blocks where X is cached
20
10
The MESI protocol
• As described earlier, a block can have three states (specified in the directory)
• Invalid or Uncached : no processor has it (not valid in any cache)
• Shared/clean: cached in one or more processors and memory is up-to-date
• Modified/dirty: one processor (owner) has data; memory out-of-date
• If MESI is implemented using a directory, then the information kept for each block in
the directory is the same as the three state protocol:
• Shared in MESI = shared/clean but more than one sharer
• Exclusive in MESI = shared/clean but only one sharer
• Modified in MESI = Exclusive/Modified/dirty
• However, at each cached copy, a distinction is made between shared and exclusive.
21
22
11
The MESI protocol
• If MESI is implemented as a snooping protocol, then the main advantage over
the three state protocol is when a read to an uncached block is followed by a
write to that block.
• After the uncached block is read, it is marked “exclusive”
• Note that, when writing to a shared block, the transaction has to be posted
on the bus so that other sharers invalidate their copies.
• But when writing to an exclusive block, there is no need to post the
transaction on the bus
• Hence, by distinguishing between shared and exclusive states, we can
avoid bus transactions when writing on an exclusive block.
• However, now a cache that has an “exclusive” block has to monitor the bus
for any read to that block. Such a read will change the state to “shared”.
L1 P0 L1 P1 L1 P0 L1 P1
L2 L2 L2 dir L2 dir
24
12
Cache organization in multicore systems
Shared L2 systems Private L2 systems
P P P P P P P P
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2
L2
System interconnect
Memory controller Memory controller
• Examples: Intel Core Duo Pentium • Examples: AMD Dual Core Opteron
26
13
The BlueGene (a supercomputer)
http://en.wikipedia.org/wiki/Blue_Gene
14
29
30
15
31
16