ACA Lecture 29 Cache-Coherence 2
ACA Lecture 29 Cache-Coherence 2
4
Symmetric Multiprocessors
(SMPs)
● SMPs are a popular shared memory
multiprocessor architecture:
– Processors share Memory and I/O
– Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
Bus
6
Different SMP
Organizations
● Processor and cache on separate
extension boards (1980s):
– Plugged on to the backplane.
● Integrated on the main board (1990s):
– 4 or 6 processors placed per board.
● Integrated on the same chip (multi-core)
(2000s):
– Dual core (IBM, Intel, AMD)
– Quad core 7
Pros of SMPs
● Ease of programming:
– Especially
when communication
patterns are complex or vary
dynamically during execution.
8
Cons of SMPs
● As the number of processors increases,
contention for the bus increases.
– Scalability of the SMP model restricted.
– One way out may be to use switches
(crossbar, multistage networks, etc.)
instead of a bus.
– Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of
cache coherence difficult. 9
Why Multicores?
● Can you recollect the constraints on
further increase in circuit complexity:
– Clock skew and temperature.
● Use of more complex techniques to
improve single-thread performance is
limited.
● Any additional transistors have to be
used in a different core.
10
Why Multicores?
Cont…
11
Cache Organizations for
Multicores
● L1 caches are always private to a core
● L2 caches can be private or shared
– which is better?
P1 P2 P3 P4 P1 P2 P3 P4
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2
L2
12
L2 Organizations
● Advantages of a shared L2 cache:
– Efficient dynamic use of space by each core
– Data shared by multiple cores is not
replicated.
– Every block has a fixed “home” – hence, easy
to find the latest copy.
● Advantages of a private L2 cache:
– Quick access to private L2
– Private bus to private L2, less contention.
13
An Important Problem with
Shared-Memory: Coherence
● When shared data are cached:
– These are replicated in multiple
caches.
–The data in the caches of different
processors may become inconsistent.
● How to enforce cache coherency?
– How does a processor know changes in
the caches of other processors?
14
The Cache Coherency
Problem
5
P1 4 P2 P3
1 U:? U:? 3 U:7 3
1 2
U:5
15
Cache Coherence Solutions
(Protocols)
● The key to maintain cache coherence:
– Track the state of sharing of every
data block.
● Based on this idea, following can be
an overall solution:
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
16
Basic Idea Behind Cache
Coherency Protocols
P P P P
17
Pros and Cons of the
Solution
● Pro:
– Consistencymaintenance becomes
transparent to programmers,
compilers, as well as to the
operating system.
● Con:
– Increased hardware complexity .
18
Two Important Cache
Coherency Protocols
● Snooping protocol:
– Each cache “snoops” the bus to find out
which data is being used by whom.
● Directory-based protocol:
– Keep track of the sharing state of each
data block using a directory.
– A directory is a centralized register for
all memory blocks.
– Allows coherency protocol to avoid
broadcasts.
19
Snoopy and Directory-
Based Protocols
P P P P
Bus
20
Snooping vs. Directory-
based Protocols
● Snooping protocol reduces memory
traffic.
– More efficient.
● Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
– Even when there is a shared bus, scalability
is a problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses. 21
Snooping Protocol
● As soon as a request for any data block
by a processor is put out on the bus:
– Other processors “snoop” to check if they
have a copy and respond accordingly.
● Works well with bus interconnection:
– All
transmissions on a bus are essentially
broadcast:
● Snooping is therefore effortless.
– Dominates almost all small scale machines.
22
Categories of Snoopy
Protocols
● Essentially two types:
– Write Invalidate Protocol
– Write Broadcast Protocol
● Write invalidate protocol:
– When one processor writes to its cache, all
other processors having a copy of that
data block invalidate that block.
● Write broadcast:
– When one processor writes to its cache, all
other processors having a copy of that
data block update that block with the
recent written value. 23
24
25
Write Invalidate Vs.
Write Update Protocols
P P P P
Bus
26
Write Invalidate Protocol
● Handling a write to shared data:
– An invalidate command is sent on bus ---
all caches snoop and invalidate any copies
they have.
● Handling a read Miss:
– Write-through: memory is always up-to-
date.
– Write-back: snooping finds most recent
copy.
27
Write Invalidate in Write
Through Caches
● Simple implementation.
● Writes:
– Write to shared data: broadcast on bus,
processors snoop, and update any copies.
– Read miss: memory is always up-to-date.
● Concurrent writes:
– Write serialization automatically achieved
since bus serializes requests.
– Bus provides the basic arbitration support.
28
Write Invalidate versus
Broadcast cont…
31
Implementation of the
Snooping Protocol
● A cache controller at every processor
would implement the protocol:
– Has to perform specific actions:
● When the local processor requests certain
things.
● Also, certain actions are required when certain
address appears on the bus.
– Exact actions of the cache controller
depends on the state of the cache block.
– Two FSMs can show the different types of
actions to be performed by a controller. 32
Snoopy-Cache State
Machine-I CPU Read hit
● State machine
considering only
CPU requests Shared
Invalid CPU Read
a each cache (read/o
Place read miss
block. nly)
on bus
CPU Write
Place Write Miss on Bus
Exclusiv
e
CPU read hit CPU Write Miss
CPU write hit
(read/wr
Write back cache block
ite) Place write miss on bus 33
Snoopy-Cache State
Machine-II
● State machine
Write miss
considering only Shared
bus requests for this block
Invalid (read/o
for each cache nly)
block.
Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
memory
Exclusiv access)
e
(read/wr
ite) 34
Combined Snoopy-Cache
● State machine
State Machine CPU Read hit
Write miss
considering both
CPU requests for this block Shared
and bus requests Invalid CPU Read (read/o
Place read miss
for each nly)
CPU Write on bus
cache block. Place Write
Write miss Miss on bus
for this block CPU read miss CPU Read miss
Write back block, Place read miss
Write Back
Place read miss on bus
Block; Abort CPU Write
on bus
memory access. Place Write Miss on Bus
Write Back
Read miss
Block; (abort
for this block
Exclusiv memory access)
e
CPU read hit CPU Write Miss
CPU write hit
(read/wr
Write back cache block
ite) Place write miss on bus 35
Directory-based Solution
● In NUMA computers:
– Messages have long latency.
– Also, broadcast is inefficient --- all
messages have explicit responses.
● Main memory controller to keep track of:
– Which processors are having cached copies
of which memory locations.
● On a write,
– Only need to inform users, not everyone
● On a dirty read,
– Forward to owner
36
Directory Protocol
● Three states as in Snoopy Protocol
– Shared: 1 or more processors have data,
memory is up-to-date.
– Uncached: No processor has the block.
– Exclusive: 1 processor (owner) has the block.
● In addition to cache state,
– Must track which processors have data when
in the shared state.
– Usually implemented using bit vector, 1 if
processor has copy.
37
Directory Behavior
● On a read:
– Unused:
● give (exclusive) copy to requester
● record owner
– Exclusive or shared:
● send share message to current exclusive
owner
● record owner
● return value
– Exclusive dirty:
● forward read request to exclusive owner.
38
Directory Behavior
● On Write
– Send invalidate messages to all
hosts caching values.
● On Write-Thru/Write-back
– Update value.
39
CPU-Cache State Machine
Invalidate CPU Read hit
● State machine or Miss due to
for CPU requests Uncacheed address conflict: Shared
for each (read/o
CPU Read
memory block Send Read Miss
nly)
message
● Invalid state
CPU Write:
if in Fetch/Invalidate Send Write Miss
CPU Write:
Send
memory or Miss due to msg to h.d.
Write Miss message
address conflict:
send Data Write Back message to home directory
to home directory
Fetch: send
Data Write Back message
Exclusive to home directory
(read/wri
CPU read hit te)
CPU write hit 40
State Transition Diagram
for the Directory
● Tracks all copies of memory block.
● Same states as the transition diagram
for an individual cache.
● Memory controller actions:
– Update of directory state
– Send msgs to statisfy requests.
– Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message. 41
Directory State Machine Read miss:
Sharers += {P};
● State machine Read miss: send Data Value Reply
for Directory Sharers = {P}
requests for each send Data Value
Shared
Reply
memory block Uncached (read
only)
● Uncached state
if in memory Write Miss:
Write Miss:
Sharers = {P};
Data Write Back: send Invalidate
send Data
Sharers = {} to Sharers;
Value Reply
(Write back block) then Sharers = {P};
msg
send Data Value
Reply msg
Read miss:
Sharers += {P};
Write Miss: Exclusive send Fetch;
Sharers = {P}; (read/wri send Data Value Reply
send Fetch/Invalidate; te) msg to remote cache
send Data Value Reply (Write back block)
42
msg to remote cache