Memory Coherent
Memory Coherent
Lecture 17:
“Multicore Cache Coherence”
John P. Shen Prevalence of multicore processors:
▪ 2006: 75% for desktops, 85% for servers
October 25, 2017 ▪ 2007: 90% for desktops and mobiles, 100%
for servers
▪ Today: 100% multicore processors with core
counts ranging from 2 to 8 cores for
desktops and mobiles, 8+ cores for servers
➢ Recommended Reference:
• “Parallel Computer Organization and Design,” by Michel Dubois,
Murali Annavaram, Per Stenstrom, Chapters 5 and 7, 2012.
Lecture 17:
“Multicore Cache Coherence”
A. Multicore Processors
▪ The Case for Multicores
▪ Programming for Multicores
▪ The Cache Coherence Problem
B. Cache Coherence Protocol Categories
▪ Write Update
▪ Write Invalidate
C. Specific Bus-Based Snoopy Protocols
▪ VI & MI Protocols
▪ MSI, MESI, MOESI Protocols
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 2
The Case for Multicore Processors (MCP)
Multicore Processor package
Core 0 Core 3
Stalled Scaling of Single-
Regs Regs
Core Performance
L1 L1 L1 L1
Expected continuation d-cache i-cache
… d-cache i-cache
Main memory
Relative Power
▪ Keep contributions due to 20
microarchitecture design power = perf (1.74)
▪ Normalize to i486™ processor 15
Relative Power
many cores 20
power= =perf
power (1.74)
perf(1.74)
❖ Replication of cores results in 15
Scalar/Latency Throughput
nearly proportional increases to Performance Performance
10
both throughput performance Pentium Pro Pentium M
multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 24
The Cache Coherence Problem
Core 1 reads x
multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 25
The Cache Coherence Problem
Core 2 reads x
multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 26
The Cache Coherence Problem
Core 1 writes to x, setting it to 21660
multi-core chip
assuming
Main memory write-through
x=21660 caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 27
The Cache Coherence Problem
Core 2 attempts to read x… gets a stale copy
multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 28
Solutions for Cache Coherence Problem
• This is a general problem with shared memory
multiprocessors and multicores with private caches
• Coherence Solution:
• Use HW to ensure that loads from all cores will return the
value of the latest store to that memory location
• Use metadata to track the state for cached data
• There exist two major categories with many specific
coherence protocols.
multi-core chip
Main memory
inter-core bus
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 30
Invalidation Protocol with Snooping
• Invalidation:
If a core writes to a data item, all other copies of this
data item in other caches are invalidated
• Snooping:
All cores continuously “snoop” (monitor) the bus
connecting the cores.
multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 32
Invalidation Based Cache Coherence Protocol
Core 1 writes to x, setting it to 21660
sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming
x=21660 write-through inter-core bus
caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 33
Invalidation Based Cache Coherence Protocol
After invalidation:
multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 34
Invalidation Based Cache Coherence Protocol
Core 2 reads x. Cache misses, and loads the new copy.
multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 35
Update Based Cache Coherence Protocol
Core 1 writes x=21660:
broadcasts
multi-core chip
updated
Main memory assuming
value
x=21660 write-through inter-core bus
caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 36
Invalidation vs. Update Protocols
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable value)
Bus
Protocol consists of
Bus Actions
states and actions
(state transitions)
Actions can be
Cache
invoked from Controller
State Tags Cache Data
Read
miss
S I
Read by any Other processor Cache state in
processor intent to write processor P1
P2 P1 reads,
P2 reads
or writes
P2 writes back M
Write miss
P1 intent to write
Read
miss
S I
P1 intent to write
Exclusive: <1,0,0,…,1>
Shared: <1,X,X,…,1>
Invalid: <0,X,X,…X>
Bus/Processor Actions
Same as MSI
0. Initially: <0,0,0,1> I I I
In shared-bus system
T1 can snarf data from the bus during the writeback
Called cache-to-cache transfer or dirty miss or intervention
3. T1 read→ CR C0 <1,1,0,1> S S I
Shared: <1,X,X,…,1>
Invalid: <0,X,X,…X>
Cache Read
If no sharers:
Cache Read&M No Action No Action No Action
I →E
→M →I →I →I
If sharers:
→S
Respond
Respond shared;
No Action No Action No Action shared;
E Supply data;
→E →M →I Supply data;
→S
→I
0. Initially: <0,0,0,1> I I I
3. T2 read→ CR C0 <1,0,1,0> O I S
Shared
Invalid
There are also some variations (MOSI and MESI)
What happens when 2 cores read/write different words in a cache
line?
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 58
Snooping with Multi-level Caches
Private L2 caches
If inclusive, snooping traffic checked at the L2 level first
Only accesses that refer to data cached in L1 need to be
forwarded
Saves bandwidth at the L1 cache
Shared L2 or L3 caches
Can act as serialization points even if there is no bus
Track state of cache line and list of sharers (bit mask)
Essentially the shared cache acts like a coherence directory