0% found this document useful (0 votes)
4 views62 pages

Memory Coherent

Uploaded by

6cc6nqjdzn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views62 pages

Memory Coherent

Uploaded by

6cc6nqjdzn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

18-600 Foundations of Computer Systems

Lecture 17:
“Multicore Cache Coherence”
John P. Shen Prevalence of multicore processors:
▪ 2006: 75% for desktops, 85% for servers
October 25, 2017 ▪ 2007: 90% for desktops and mobiles, 100%
for servers
▪ Today: 100% multicore processors with core
counts ranging from 2 to 8 cores for
desktops and mobiles, 8+ cores for servers

➢ Recommended Reference:
• “Parallel Computer Organization and Design,” by Michel Dubois,
Murali Annavaram, Per Stenstrom, Chapters 5 and 7, 2012.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 1


18-600 Foundations of Computer Systems

Lecture 17:
“Multicore Cache Coherence”
A. Multicore Processors
▪ The Case for Multicores
▪ Programming for Multicores
▪ The Cache Coherence Problem
B. Cache Coherence Protocol Categories
▪ Write Update
▪ Write Invalidate
C. Specific Bus-Based Snoopy Protocols
▪ VI & MI Protocols
▪ MSI, MESI, MOESI Protocols
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 2
The Case for Multicore Processors (MCP)
Multicore Processor package
Core 0 Core 3
 Stalled Scaling of Single-
Regs Regs
Core Performance
L1 L1 L1 L1
 Expected continuation d-cache i-cache
… d-cache i-cache

of Moore’s Law L2 unified L2 unified


cache cache
 Throughput
Performance for Server L3 unified cache
(shared by all cores)
Workloads

Main memory

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 3


Processor Scaling Until ~2004

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 4


Processor Development Until ~2004
 Moore’s Law: transistor count doubles every 18 months
 Used to improve processor performance by 2x every 18 months
 Single core, binary compatible to previous generations

 Contributors to performance improvements


 More ILP through OOO superscalar techniques
 Wider issue, better branch prediction, better instruction scheduling, …

 Better memory hierarchies, faster and larger

 Clock frequency improvements with deeper pipelines

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 5


Problems with Single Core Performance
 Moore’s Law is still doing well (for the foreseeable future…)
 The Power Wall
 Power ≈ CL * Vdd2 * Freq
 Cannot scale transistor count and frequency without reducing Vdd
 Unfortunately, voltage scaling has essentially stalled
 The Complexity Wall
 Designing and verifying increasingly large OOO cores is very expensive
 100s of engineers for 3-5 years

 Caches are easier to design but can only help so much…

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 6


[Ed Grochowski, 2004]

Power & Latency (Single-Thread) Performance


❖ For comparison 30
▪ Factor out contributions due to process Pentium 4 (Psc)
25
technology Pentium 4 (Wmt)

Relative Power
▪ Keep contributions due to 20
microarchitecture design power = perf (1.74)
▪ Normalize to i486™ processor 15

❖ Relative to i486™ Pentium® 4 (Wmt) 10 Pentium Pro


processor is
5
▪ 6x faster (2X IPC at 3X frequency) Pentium
i486
▪ 23x higher power 0
▪ Spending 4 units of power for every 1 0 2 4 6 8
unit of scalar performance Relative Performance

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 7


From ILP to TLP
 So far, we run single process, single thread
 Extracting ILP from sequential instruction stream

 Single-thread performance can't scale indefinitely!


 Limited ILP within each thread
 Power consumption & complexity of superscalar cores

 We will now pursue Thread-Level Parallelism (TLP)


 To increase utilization and tolerate latency in single core
 To exploit multiple cores
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 8
Thread-Level Parallelism
 Instruction-level parallelism (ILP)
 Reaps performance by finding independent work in a single thread
 Thread-level parallelism (TLP)
 Reaps performance by finding independent work across multiple threads
 Historically, requires explicitly parallel workloads
 Originate from mainframe time-sharing workloads
 Even then, CPU speed >> I/O speed
 Had to overlap I/O latency with “something else” for the CPU to do
 Hence, operating system would schedule other tasks/processes/threads
that were “time-sharing” the CPU

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 9


Thread-Level Parallelism

 Reduces effectiveness of temporal and spatial locality


10/25/2017 (© J.P. Shen) 18-600 Lecture #17 10
Thread-Level Parallelism
 Initially motivated by time-sharing of single CPU
 OS, applications written to be multithreaded
 Quickly led to adoption of multiple CPUs in a single system
 Enabled scalable product line from entry-level single-CPU systems to high-end
multiple-CPU systems
 Same applications, OS, run seamlessly
 Adding CPUs increases throughput (performance)
 More recently:
 Multiple threads per processor core
 Coarse-grained multithreading (aka “switch-on-event”)

 Simultaneous multithreading (aka “hyper-threading”)

 Multiple processor cores per die


 Chip multiprocessors (CMP) or “Muticore processors” (MCP)

 Chip multithreading (CMT)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 11


Recall: Processes and Software Threads
 Process: an instance of a program executing in a system
 OS supports concurrent execution of multiple processes
 Each process has its own address space, set of registers, and PC
 Two different processes can partially share their address spaces to
communicate
 Thread: an independent control stream within a process
 A process can have one or more threads
 Private state: PC, registers (int, FP), stack, thread-local storage
 Shared state: heap, address space (VM structures)
 A “parallel program” is one process but multiple threads
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 12
Reminder: Classic OS Context Switch
 OS context-switch
 Timer interrupt stops a program mid-execution (precise)
 OS saves the context of the stopped thread
 PC, GPRs, and more

 Shared state such as physical pages are not saved

 OS restores the context of a previously stopped thread (all except PC)


 OS uses a “return from exception” to jump to the restarting PC
 The restored thread has no idea it was interrupted, removed, later restarted

 Take a few hundred cycles per switch (why?)


 Amortized over the execution “quantum”

 What latencies can you hide using OS context switching?


 How much faster would a user-level thread switch be?
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 13
Multithreaded Cores (old “Multiprogramming”)
 Basic idea:
 CPU resources are expensive and should not be idle
 1960’s: Virtual memory and multiprogramming
 Virtual memory/multiprogramming invented to tolerate latency to
secondary storage (disk/tape/etc.)
 Processor-disk speed mismatch:
 microseconds to tens of milliseconds (1:10,000 or more)
 OS context switch used to bring in other useful work while waiting for page
fault or explicit file read/write accesses
 Cost of context switch must be much less than I/O latency (easy)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 14


Multithreaded Cores (new “Multithreading”)
 1990’s: Memory wall and multithreading
 Processor-DRAM speed mismatch:
 nanosecond to fractions of a microsecond (1:500)
 H/W task switch used to bring in other useful work while
waiting for cache miss
 Cost of context switch must be much less than cache miss
latency
 Very attractive for applications with abundant thread-level
parallelism
 Commercial multi-user (transaction processing) workloads

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 15


Processor Scaling Since ~2005

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 16


The Multicore Alternative
 Use Moore’s law to place more cores per chip
 Potentially 2x cores/chip with each CMOS generation
 Without significantly compromising clock frequency
 Known as Multi-Core Processors (MCP) or Chip Multiprocessors (CMP)
 The good news
 Continued scaling of chip-level peak (throughput) performance
 Mitigate the undesirable superscalar power scaling (“wrong side of the square law”)
 Facilitate design and verification, and product differentiation
 The bad news
 Require multithreaded workloads: multiple programs or parallel programs
 Require parallelizing single applications into parallel programs
 Power is still an issue as transistors shrink due to leakage current

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 17


Big OOO Superscalar vs. Multicore Processor

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 18


[Ed Grochowski, 2004]

Power & Throughput (Multi-Thread) Performance


30
❖ Assume a large-scale multicore Pentium 4 (Psc)
25
processor (MCP) with potentially Pentium 4 (Wmt)

Relative Power
many cores 20
power= =perf
power (1.74)
perf(1.74)
❖ Replication of cores results in 15
Scalar/Latency Throughput
nearly proportional increases to Performance Performance
10
both throughput performance Pentium Pro Pentium M

and power (hopefully). 5 Pentium


i486 power = perf (1.0) ?
0
0 2 4 6 8
Relative Performance

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 19


Programming for Multicore Processors (MCP)
➢ Programmers must write parallel programs using threads/processes.
➢ Spread the workload across multiple cores at run time.
➢ OS will map threads/processes to cores at run time.

Assigning Threads to Cores:


➢ Each thread/process has an affinity mask
➢ Affinity mask specifies what cores the thread is allowed to run on.
➢ Different threads can have different masks
➢ Affinities are inherited across fork()

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 20


Shared-Memory Multiprocessors or Multicores
 All processor cores have access to unified physical memory
 They can communicate via the shared memory using loads and stores
 Advantages
 Supports multi-threading (TLP) using multiple cores
 Requires relatively simple changes to the OS for scheduling
 Threads within an app can communicate implicitly without using OS
 Simpler to code for and lower overhead

 App development: first focus on correctness, then on performance


 Disadvantages
 Implicit communication is hard to optimize
 Synchronization can get tricky
 Higher hardware complexity for cache management
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 21
Caches for Multicores (or Multicore Processors)
 Caches are (equally) helpful with multicores
 Reduce access latency, reduce bandwidth requirements
 For both private and shared data across cores
 Advantages of private caches:
 They are closer to core, so faster access
 Reduces contention to cache by cores
 Advantages of shared cache:
 Threads on different cores can share the same cache data
 More cache space available if a single (or a few) high-performance
thread runs on the system
 But multiple private caches introduce the two problems of
 Cache Coherence (cover in this lecture)
 Memory Consistency (beyond this course)
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 22
The Cache Coherence Problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a monolithic array, shared
by all the cores

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 23


The Cache Coherence Problem
Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 24
The Cache Coherence Problem
Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 25
The Cache Coherence Problem
Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 26
The Cache Coherence Problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
assuming
Main memory write-through
x=21660 caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 27
The Cache Coherence Problem
Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 28
Solutions for Cache Coherence Problem
• This is a general problem with shared memory
multiprocessors and multicores with private caches
• Coherence Solution:
• Use HW to ensure that loads from all cores will return the
value of the latest store to that memory location
• Use metadata to track the state for cached data
• There exist two major categories with many specific
coherence protocols.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 29


Bus Based (“Snooping”) Multicore Processor

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
inter-core bus
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 30
Invalidation Protocol with Snooping
• Invalidation:
If a core writes to a data item, all other copies of this
data item in other caches are invalidated
• Snooping:
All cores continuously “snoop” (monitor) the bus
connecting the cores.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 31


Invalidation Based Cache Coherence Protocol
Revisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 32
Invalidation Based Cache Coherence Protocol
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming
x=21660 write-through inter-core bus
caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 33
Invalidation Based Cache Coherence Protocol
After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660

multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 34
Invalidation Based Cache Coherence Protocol
Core 2 reads x. Cache misses, and loads the new copy.

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660

multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 35
Update Based Cache Coherence Protocol
Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660
UPDATED

broadcasts
multi-core chip
updated
Main memory assuming
value
x=21660 write-through inter-core bus
caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 36
Invalidation vs. Update Protocols
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable value)

• Invalidation generally performs better:


it generates less bus traffic

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 37


Cache Coherence
 Informally, with coherent caches: accesses to a memory
location appear to occur simultaneously in all copies of that
memory location
“copies”  caches + memory
 Cache coherence suggests an absolute time scale -- this is
not necessary
 What is required is the "appearance" of coherence... not
absolute coherence
 E.g. temporary incoherence between memory and a write-back
cache may be OK.
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 38
Write Update vs.
Write Invalidate
 Coherent caches with
Shared Memory
 All cores see the effects
of others’ writes
 How/when writes are
propagated
 Determined by
coherence protocol

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 39


Bus-Based Snoopy Cache Coherence
 All requests broadcast on the bus
 All processors (or private caches) and memory snoop and respond
 Cache blocks writeable at one processor or read-only at several
 Single-writer protocol
 Snoops that hit dirty (i.e. modified) lines?
 Flush modified data out of cache
 Either write back to memory, then satisfy remote miss from memory, or
 Provide dirty (modified) data directly to requestor
 Big problem in shared-memory multicore processor systems
 Dirty/coherence/sharing misses

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 40


Bus-Based Protocols Main Memory

Bus
 Protocol consists of
Bus Actions
states and actions
(state transitions)
 Actions can be
Cache
invoked from Controller
State Tags Cache Data

processor or bus to Processor Actions


the cache controller
 Coherence based on
per cache line (block) Processor

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 41


Minimal Coherence Protocol (Write-Back Caches)
 Blocks are always private or Valid
exclusive (M)
Local
 State transitions: Read or
Local Evict or
 Local read: I->M, fetch, Local Read or Remote
Write
invalidate other copies Local Write Read or
 Local write: I->M, fetch, Remote
Write
invalidate other copies
 Evict: M->I, write back data Invalid
 Remote read: M->I, write Cache
(I)
back data Tag State Data
 Remote write: M->I, write A M …
back data B I …

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 42


Invalidate Protocol Optimization
 Observation: data often read shared by multiple CPUs
 Add S (shared) state to protocol: MSI
 State transitions:
 Local read: I->S, fetch shared
 Local write: I->M, fetch modified; S->M, invalidate other copies
 Remote read: M->I, write back data
 Remote write: M->I, write back data

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 43


MSI Protocol (with Write Back Cache)
Action and Next State

Current Processor Processor Eviction Cache Cache Cache


State Read Write Read Read&M Upgrade
I Cache Read Cache Read&M No Action No Action No Action
Acquire Acquire Copy →I →I →I
Copy →M
→S
S No Action Cache Upgrade No Action No Action Invalidate Invalidate
→S →M →I →S Frame Frame
→I →I
M No Action No Action Cache Memory Invalidate
→M →M Write inhibit; Frame;
back Supply Memory
→I data; inhibit;
→S Supply data;
→I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 44


MSI Example
Thread Event Bus Action Data From Global State Local
States:
C0 C1 C2
0. Initially: <0,0,0,1> I I I
1. T0 read→ CR Memory <1,0,0,1> S I I
2. T0 write→ CU <1,0,0,0> M I I
3. T2 read→ CR C0 <1,0,1,1> S I S
4. T1 write→ CRM Memory <0,1,0,0> I M I

 If line is in no other cache


 Read, modify, Write requires 2 bus transactions
 Optimization: add Exclusive state

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 45


MSI: A Coherence Protocol (Write Back Caches)
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other processor reads M or writes
P1 writes back Write miss
Other processor
intent to write

Read
miss
S I
Read by any Other processor Cache state in
processor intent to write processor P1

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 46


MSI Coherence Protocol Example with 2 Cores
P1 reads
P1 reads P1 P2 reads, or writes
P1 writes P1 writes back
M
P2 reads Write miss
P2 writes
P2 intent to write
P1 reads
P1 writes Read
miss
P2 writes S I
P1 writes P2 intent to write

P2 P1 reads,
P2 reads
or writes
P2 writes back M
Write miss

P1 intent to write
Read
miss
S I
P1 intent to write

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 47


Invalidate Protocol Optimizations
 Observation: data can be write-private (e.g. stack frame)
 Avoid invalidate messages in that case
 Add E (exclusive) state to protocol: MESI
 State transitions:
 Local read: I->E if only copy, I->S if other copies exist
 Local write: E->M silently, S->M, invalidate other copies

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 48


MESI Protocol (most common in industry)
 Variation used in many Intel processors
 4-State Protocol
 Modified: <1,0,0…0>

 Exclusive: <1,0,0,…,1>

 Shared: <1,X,X,…,1>

 Invalid: <0,X,X,…X>

 Bus/Processor Actions
 Same as MSI

 Adds shared signal to indicate if other caches have a copy

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 49


MESI Protocol
Action and Next State

Current Processor Processor Cache Cache Cache


Eviction
State Read Write Read Read&M Upgrade
Cache
Read
If no
Cache Read&M No Action No Action No Action
I sharers:
→M →I →I →I
→E
If sharers:
→S
Respond
No Action Cache Upgrade No Action No Action No Action
S Shared:
→S →M →I →I →I
→S
Respond
No Action No Action No Action No Action
E Shared;
→E →M →I →I
→S
Respond Respond
Cache dirty; dirty;
No Action No Action
M Write-back Write back Write back
→M →M
→I data; data;
→S →I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 50


MESI Example

Thread Event Bus Data Global State Local States:


Action From C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 51


Cache-to-Cache Transfers
 Common in many workloads:
 T0 writes to a block: <1,0,…,0> (block in M state in T0)
 T1 reads from block: T0 must write back, then T1 reads from memory

 In shared-bus system
 T1 can snarf data from the bus during the writeback
 Called cache-to-cache transfer or dirty miss or intervention

 Without shared bus


 Must explicitly send data to requestor and to memory (for writeback)

 Known as the 4th C (cold, capacity, conflict, communication)


10/25/2017 (© J.P. Shen) 18-600 Lecture #17 52
MESI Example 2
Thread Event Bus Data From Global State Local States:
Action C0 C1 C2
0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

3. T1 read→ CR C0 <1,1,0,1> S S I

4. T2 read→ CR Memory <1,1,1,1> S S S

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 53


MOESI Optimization (IEEE Standard)
 Observation: shared ownership prevents cache-to-cache
transfer, causes unnecessary memory read
 Add O (owner) state to protocol: MOSI/MOESI
 Last requestor becomes the owner
 Avoid writeback (to memory) of dirty data
 Also called shared-dirty state, since memory is stale

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 54


MOESI Protocol
 Used in AMD Opteron
 5-State Protocol
 Modified: <1,0,0…0>
 Exclusive: <1,0,0,…,1>

 Shared: <1,X,X,…,1>

 Invalid: <0,X,X,…X>

 Owned: <1,X,X,X,0> ; only one owner, memory not up to date

 Owner can supply data, so memory does not have to


 Avoids lengthy memory access
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 55
MOESI Protocol
Action and Next State

Processor Processor Cache


Current State Eviction Cache Read Cache Read&M
Read Write Upgrade

Cache Read
If no sharers:
Cache Read&M No Action No Action No Action
I →E
→M →I →I →I
If sharers:
→S

No Action Cache Upgrade No Action Respond shared; No Action No Action


S
→S →M →I →S →I →I

Respond
Respond shared;
No Action No Action No Action shared;
E Supply data;
→E →M →I Supply data;
→S
→I

Cache Write- Respond shared; Respond shared;


No Action Cache Upgrade
O back Supply data; Supply data;
→O →M
→I →O →I

Cache Write- Respond shared; Respond shared;


No Action No Action
M back Supply data; Supply data;
→M →M
→I →O →I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 56


MOESI Example
Thread Event Bus Action Data From Global State local states
C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

3. T2 read→ CR C0 <1,0,1,0> O I S

4. T1 write→ CRM C0 <0,1,0,0> I M I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 57


MOESI Coherence Protocol
 A protocol that tracks validity, ownership, and exclusiveness
 Modified: dirty and private
 Owned: dirty but shared
 Avoid writeback to memory on M->S transitions

 Exclusive: clean but private


 Avoid upgrade misses on private data

 Shared
 Invalid
 There are also some variations (MOSI and MESI)
 What happens when 2 cores read/write different words in a cache
line?
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 58
Snooping with Multi-level Caches
 Private L2 caches
 If inclusive, snooping traffic checked at the L2 level first
 Only accesses that refer to data cached in L1 need to be
forwarded
 Saves bandwidth at the L1 cache

 Shared L2 or L3 caches
 Can act as serialization points even if there is no bus
 Track state of cache line and list of sharers (bit mask)
 Essentially the shared cache acts like a coherence directory

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 59


Scaling Coherence Protocols
 The problem
 Too much broadcast traffic for snooping (probing)
 Solution: probe filters
 Maintain info of which address ranges that are definitely not shared or
definitely shared
 Allows filtering of snoop traffic
 Solution: directory based coherence
 A directory stores all coherence info (e.g., sharers)
 Consult directory before sending coherence messages
 Caching/filtering schemes to avoid latency of 3-hops

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 60


The Memory Consistency Problem: Example
P1 P2
/*Assume initial value of A and flag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;

 Intuitively, you expect to print A=1


 But can you think of a case where you will print A=0?
 Even if cache coherence is available
 Coherence talks about accesses to a single location
 Consistency is about ordering for accesses to difference locations
 Alternatively
 Coherence determines what value is returned by a read
 Consistency determines when a write value becomes visible

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 61


18-600 Foundations of Computer Systems
Lecture 18:
18-600
“Program Performance Optimizations”
SE
John P. Shen & Gregory Kesden
November 1, 2017 PL
OS
CA

➢ Required Reading Assignment:


• Chapter 5 of CS:APP (3rd edition) by Randy Bryant & Dave O’Hallaron.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 62

You might also like