0% found this document useful (0 votes)

4 views62 pages

Memory Coherent

Uploaded by

6cc6nqjdzn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views62 pages

Memory Coherent

Uploaded by

6cc6nqjdzn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

18-600 Foundations of Computer Systems

Lecture 17:
“Multicore Cache Coherence”
John P. Shen Prevalence of multicore processors:
▪ 2006: 75% for desktops, 85% for servers
October 25, 2017 ▪ 2007: 90% for desktops and mobiles, 100%
for servers
▪ Today: 100% multicore processors with core
counts ranging from 2 to 8 cores for
desktops and mobiles, 8+ cores for servers

➢ Recommended Reference:
• “Parallel Computer Organization and Design,” by Michel Dubois,
Murali Annavaram, Per Stenstrom, Chapters 5 and 7, 2012.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 1

18-600 Foundations of Computer Systems

Lecture 17:
“Multicore Cache Coherence”
A. Multicore Processors
▪ The Case for Multicores
▪ Programming for Multicores
▪ The Cache Coherence Problem
B. Cache Coherence Protocol Categories
▪ Write Update
▪ Write Invalidate
C. Specific Bus-Based Snoopy Protocols
▪ VI & MI Protocols
▪ MSI, MESI, MOESI Protocols
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 2
The Case for Multicore Processors (MCP)
Multicore Processor package
Core 0 Core 3
 Stalled Scaling of Single-
Regs Regs
Core Performance
L1 L1 L1 L1
 Expected continuation d-cache i-cache
… d-cache i-cache

of Moore’s Law L2 unified L2 unified

cache cache
 Throughput
Performance for Server L3 unified cache
(shared by all cores)
Workloads

Main memory

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 3

Processor Scaling Until ~2004

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 4

Processor Development Until ~2004
 Moore’s Law: transistor count doubles every 18 months
 Used to improve processor performance by 2x every 18 months
 Single core, binary compatible to previous generations

 Contributors to performance improvements

 More ILP through OOO superscalar techniques
 Wider issue, better branch prediction, better instruction scheduling, …

 Better memory hierarchies, faster and larger

 Clock frequency improvements with deeper pipelines

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 5

Problems with Single Core Performance
 Moore’s Law is still doing well (for the foreseeable future…)
 The Power Wall
 Power ≈ CL * Vdd2 * Freq
 Cannot scale transistor count and frequency without reducing Vdd
 Unfortunately, voltage scaling has essentially stalled
 The Complexity Wall
 Designing and verifying increasingly large OOO cores is very expensive
 100s of engineers for 3-5 years

 Caches are easier to design but can only help so much…

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 6

[Ed Grochowski, 2004]

Power & Latency (Single-Thread) Performance

❖ For comparison 30
▪ Factor out contributions due to process Pentium 4 (Psc)
25
technology Pentium 4 (Wmt)

Relative Power
▪ Keep contributions due to 20
microarchitecture design power = perf (1.74)
▪ Normalize to i486™ processor 15

❖ Relative to i486™ Pentium® 4 (Wmt) 10 Pentium Pro

processor is
5
▪ 6x faster (2X IPC at 3X frequency) Pentium
i486
▪ 23x higher power 0
▪ Spending 4 units of power for every 1 0 2 4 6 8
unit of scalar performance Relative Performance

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 7

From ILP to TLP
 So far, we run single process, single thread
 Extracting ILP from sequential instruction stream

 Single-thread performance can't scale indefinitely!

 Limited ILP within each thread
 Power consumption & complexity of superscalar cores

 We will now pursue Thread-Level Parallelism (TLP)

 To increase utilization and tolerate latency in single core
 To exploit multiple cores
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 8
Thread-Level Parallelism
 Instruction-level parallelism (ILP)
 Reaps performance by finding independent work in a single thread
 Thread-level parallelism (TLP)
 Reaps performance by finding independent work across multiple threads
 Historically, requires explicitly parallel workloads
 Originate from mainframe time-sharing workloads
 Even then, CPU speed >> I/O speed
 Had to overlap I/O latency with “something else” for the CPU to do
 Hence, operating system would schedule other tasks/processes/threads
that were “time-sharing” the CPU

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 9

Thread-Level Parallelism

 Reduces effectiveness of temporal and spatial locality

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 10
Thread-Level Parallelism
 Initially motivated by time-sharing of single CPU
 OS, applications written to be multithreaded
 Quickly led to adoption of multiple CPUs in a single system
 Enabled scalable product line from entry-level single-CPU systems to high-end
multiple-CPU systems
 Same applications, OS, run seamlessly
 Adding CPUs increases throughput (performance)
 More recently:
 Multiple threads per processor core
 Coarse-grained multithreading (aka “switch-on-event”)

 Simultaneous multithreading (aka “hyper-threading”)

 Multiple processor cores per die

 Chip multiprocessors (CMP) or “Muticore processors” (MCP)

 Chip multithreading (CMT)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 11

Recall: Processes and Software Threads
 Process: an instance of a program executing in a system
 OS supports concurrent execution of multiple processes
 Each process has its own address space, set of registers, and PC
 Two different processes can partially share their address spaces to
communicate
 Thread: an independent control stream within a process
 A process can have one or more threads
 Private state: PC, registers (int, FP), stack, thread-local storage
 Shared state: heap, address space (VM structures)
 A “parallel program” is one process but multiple threads
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 12
Reminder: Classic OS Context Switch
 OS context-switch
 Timer interrupt stops a program mid-execution (precise)
 OS saves the context of the stopped thread
 PC, GPRs, and more

 Shared state such as physical pages are not saved

 OS restores the context of a previously stopped thread (all except PC)

 OS uses a “return from exception” to jump to the restarting PC
 The restored thread has no idea it was interrupted, removed, later restarted

 Take a few hundred cycles per switch (why?)

 Amortized over the execution “quantum”

 What latencies can you hide using OS context switching?

 How much faster would a user-level thread switch be?
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 13
Multithreaded Cores (old “Multiprogramming”)
 Basic idea:
 CPU resources are expensive and should not be idle
 1960’s: Virtual memory and multiprogramming
 Virtual memory/multiprogramming invented to tolerate latency to
secondary storage (disk/tape/etc.)
 Processor-disk speed mismatch:
 microseconds to tens of milliseconds (1:10,000 or more)
 OS context switch used to bring in other useful work while waiting for page
fault or explicit file read/write accesses
 Cost of context switch must be much less than I/O latency (easy)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 14

Multithreaded Cores (new “Multithreading”)
 1990’s: Memory wall and multithreading
 Processor-DRAM speed mismatch:
 nanosecond to fractions of a microsecond (1:500)
 H/W task switch used to bring in other useful work while
waiting for cache miss
 Cost of context switch must be much less than cache miss
latency
 Very attractive for applications with abundant thread-level
parallelism
 Commercial multi-user (transaction processing) workloads

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 15

Processor Scaling Since ~2005

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 16

The Multicore Alternative
 Use Moore’s law to place more cores per chip
 Potentially 2x cores/chip with each CMOS generation
 Without significantly compromising clock frequency
 Known as Multi-Core Processors (MCP) or Chip Multiprocessors (CMP)
 The good news
 Continued scaling of chip-level peak (throughput) performance
 Mitigate the undesirable superscalar power scaling (“wrong side of the square law”)
 Facilitate design and verification, and product differentiation
 The bad news
 Require multithreaded workloads: multiple programs or parallel programs
 Require parallelizing single applications into parallel programs
 Power is still an issue as transistors shrink due to leakage current

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 17

Big OOO Superscalar vs. Multicore Processor

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 18

[Ed Grochowski, 2004]

Power & Throughput (Multi-Thread) Performance

30
❖ Assume a large-scale multicore Pentium 4 (Psc)
25
processor (MCP) with potentially Pentium 4 (Wmt)

Relative Power
many cores 20
power= =perf
power (1.74)
perf(1.74)
❖ Replication of cores results in 15
Scalar/Latency Throughput
nearly proportional increases to Performance Performance
10
both throughput performance Pentium Pro Pentium M

and power (hopefully). 5 Pentium

i486 power = perf (1.0) ?
0
0 2 4 6 8
Relative Performance

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 19

Programming for Multicore Processors (MCP)
➢ Programmers must write parallel programs using threads/processes.
➢ Spread the workload across multiple cores at run time.
➢ OS will map threads/processes to cores at run time.

Assigning Threads to Cores:

➢ Each thread/process has an affinity mask
➢ Affinity mask specifies what cores the thread is allowed to run on.
➢ Different threads can have different masks
➢ Affinities are inherited across fork()

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 20

Shared-Memory Multiprocessors or Multicores
 All processor cores have access to unified physical memory
 They can communicate via the shared memory using loads and stores
 Advantages
 Supports multi-threading (TLP) using multiple cores
 Requires relatively simple changes to the OS for scheduling
 Threads within an app can communicate implicitly without using OS
 Simpler to code for and lower overhead

 App development: first focus on correctness, then on performance

 Disadvantages
 Implicit communication is hard to optimize
 Synchronization can get tricky
 Higher hardware complexity for cache management
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 21
Caches for Multicores (or Multicore Processors)
 Caches are (equally) helpful with multicores
 Reduce access latency, reduce bandwidth requirements
 For both private and shared data across cores
 Advantages of private caches:
 They are closer to core, so faster access
 Reduces contention to cache by cores
 Advantages of shared cache:
 Threads on different cores can share the same cache data
 More cache space available if a single (or a few) high-performance
thread runs on the system
 But multiple private caches introduce the two problems of
 Cache Coherence (cover in this lecture)
 Memory Consistency (beyond this course)
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 22
The Cache Coherence Problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a monolithic array, shared
by all the cores

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 23

The Cache Coherence Problem
Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 24
The Cache Coherence Problem
Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=15213

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 25
The Cache Coherence Problem
Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 26
The Cache Coherence Problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
assuming
Main memory write-through
x=21660 caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 27
The Cache Coherence Problem
Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 28
Solutions for Cache Coherence Problem
• This is a general problem with shared memory
multiprocessors and multicores with private caches
• Coherence Solution:
• Use HW to ensure that loads from all cores will return the
value of the latest store to that memory location
• Use metadata to track the state for cached data
• There exist two major categories with many specific
coherence protocols.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 29

Bus Based (“Snooping”) Multicore Processor

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
inter-core bus
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 30
Invalidation Protocol with Snooping
• Invalidation:
If a core writes to a data item, all other copies of this
data item in other caches are invalidated
• Snooping:
All cores continuously “snoop” (monitor) the bus
connecting the cores.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 31

Invalidation Based Cache Coherence Protocol
Revisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 32
Invalidation Based Cache Coherence Protocol
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming
x=21660 write-through inter-core bus
caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 33
Invalidation Based Cache Coherence Protocol
After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=21660

multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 34
Invalidation Based Cache Coherence Protocol
Core 2 reads x. Cache misses, and loads the new copy.

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660

multi-core chip
Main memory
x=21660
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 35
Update Based Cache Coherence Protocol
Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660
UPDATED

broadcasts
multi-core chip
updated
Main memory assuming
value
x=21660 write-through inter-core bus
caches
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 36
Invalidation vs. Update Protocols
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable value)

• Invalidation generally performs better:

it generates less bus traffic

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 37

Cache Coherence
 Informally, with coherent caches: accesses to a memory
location appear to occur simultaneously in all copies of that
memory location
“copies”  caches + memory
 Cache coherence suggests an absolute time scale -- this is
not necessary
 What is required is the "appearance" of coherence... not
absolute coherence
 E.g. temporary incoherence between memory and a write-back
cache may be OK.
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 38
Write Update vs.
Write Invalidate
 Coherent caches with
Shared Memory
 All cores see the effects
of others’ writes
 How/when writes are
propagated
 Determined by
coherence protocol

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 39

Bus-Based Snoopy Cache Coherence
 All requests broadcast on the bus
 All processors (or private caches) and memory snoop and respond
 Cache blocks writeable at one processor or read-only at several
 Single-writer protocol
 Snoops that hit dirty (i.e. modified) lines?
 Flush modified data out of cache
 Either write back to memory, then satisfy remote miss from memory, or
 Provide dirty (modified) data directly to requestor
 Big problem in shared-memory multicore processor systems
 Dirty/coherence/sharing misses

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 40

Bus-Based Protocols Main Memory

Bus
 Protocol consists of
Bus Actions
states and actions
(state transitions)
 Actions can be
Cache
invoked from Controller
State Tags Cache Data

processor or bus to Processor Actions

the cache controller
 Coherence based on
per cache line (block) Processor

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 41

Minimal Coherence Protocol (Write-Back Caches)
 Blocks are always private or Valid
exclusive (M)
Local
 State transitions: Read or
Local Evict or
 Local read: I->M, fetch, Local Read or Remote
Write
invalidate other copies Local Write Read or
 Local write: I->M, fetch, Remote
Write
invalidate other copies
 Evict: M->I, write back data Invalid
 Remote read: M->I, write Cache
(I)
back data Tag State Data
 Remote write: M->I, write A M …
back data B I …

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 42

Invalidate Protocol Optimization
 Observation: data often read shared by multiple CPUs
 Add S (shared) state to protocol: MSI
 State transitions:
 Local read: I->S, fetch shared
 Local write: I->M, fetch modified; S->M, invalidate other copies
 Remote read: M->I, write back data
 Remote write: M->I, write back data

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 43

MSI Protocol (with Write Back Cache)
Action and Next State

Current Processor Processor Eviction Cache Cache Cache

State Read Write Read Read&M Upgrade
I Cache Read Cache Read&M No Action No Action No Action
Acquire Acquire Copy →I →I →I
Copy →M
→S
S No Action Cache Upgrade No Action No Action Invalidate Invalidate
→S →M →I →S Frame Frame
→I →I
M No Action No Action Cache Memory Invalidate
→M →M Write inhibit; Frame;
back Supply Memory
→I data; inhibit;
→S Supply data;
→I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 44

MSI Example
Thread Event Bus Action Data From Global State Local
States:
C0 C1 C2
0. Initially: <0,0,0,1> I I I
1. T0 read→ CR Memory <1,0,0,1> S I I
2. T0 write→ CU <1,0,0,0> M I I
3. T2 read→ CR C0 <1,0,1,1> S I S
4. T1 write→ CRM Memory <0,1,0,0> I M I

 If line is in no other cache

 Read, modify, Write requires 2 bus transactions
 Optimization: add Exclusive state

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 45

MSI: A Coherence Protocol (Write Back Caches)
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other processor reads M or writes
P1 writes back Write miss
Other processor
intent to write

Read
miss
S I
Read by any Other processor Cache state in
processor intent to write processor P1

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 46

MSI Coherence Protocol Example with 2 Cores
P1 reads
P1 reads P1 P2 reads, or writes
P1 writes P1 writes back
M
P2 reads Write miss
P2 writes
P2 intent to write
P1 reads
P1 writes Read
miss
P2 writes S I
P1 writes P2 intent to write

P2 P1 reads,
P2 reads
or writes
P2 writes back M
Write miss

P1 intent to write
Read
miss
S I
P1 intent to write

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 47

Invalidate Protocol Optimizations
 Observation: data can be write-private (e.g. stack frame)
 Avoid invalidate messages in that case
 Add E (exclusive) state to protocol: MESI
 State transitions:
 Local read: I->E if only copy, I->S if other copies exist
 Local write: E->M silently, S->M, invalidate other copies

MESI Protocol (most common in industry)
 Variation used in many Intel processors
 4-State Protocol
 Modified: <1,0,0…0>

 Exclusive: <1,0,0,…,1>

 Shared: <1,X,X,…,1>

 Invalid: <0,X,X,…X>

 Bus/Processor Actions
 Same as MSI

 Adds shared signal to indicate if other caches have a copy

MESI Protocol
Action and Next State

Current Processor Processor Cache Cache Cache

Eviction
State Read Write Read Read&M Upgrade
Cache
Read
If no
Cache Read&M No Action No Action No Action
I sharers:
→M →I →I →I
→E
If sharers:
→S
Respond
No Action Cache Upgrade No Action No Action No Action
S Shared:
→S →M →I →I →I
→S
Respond
No Action No Action No Action No Action
E Shared;
→E →M →I →I
→S
Respond Respond
Cache dirty; dirty;
No Action No Action
M Write-back Write back Write back
→M →M
→I data; data;
→S →I

MESI Example

Thread Event Bus Data Global State Local States:

Action From C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

Cache-to-Cache Transfers
 Common in many workloads:
 T0 writes to a block: <1,0,…,0> (block in M state in T0)
 T1 reads from block: T0 must write back, then T1 reads from memory

 In shared-bus system
 T1 can snarf data from the bus during the writeback
 Called cache-to-cache transfer or dirty miss or intervention

 Without shared bus

 Must explicitly send data to requestor and to memory (for writeback)

 Known as the 4th C (cold, capacity, conflict, communication)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 52
MESI Example 2
Thread Event Bus Data From Global State Local States:
Action C0 C1 C2
0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

3. T1 read→ CR C0 <1,1,0,1> S S I

4. T2 read→ CR Memory <1,1,1,1> S S S

MOESI Optimization (IEEE Standard)
 Observation: shared ownership prevents cache-to-cache
transfer, causes unnecessary memory read
 Add O (owner) state to protocol: MOSI/MOESI
 Last requestor becomes the owner
 Avoid writeback (to memory) of dirty data
 Also called shared-dirty state, since memory is stale

MOESI Protocol
 Used in AMD Opteron
 5-State Protocol
 Modified: <1,0,0…0>
 Exclusive: <1,0,0,…,1>

 Shared: <1,X,X,…,1>

 Invalid: <0,X,X,…X>

 Owned: <1,X,X,X,0> ; only one owner, memory not up to date

 Owner can supply data, so memory does not have to

Processor Processor Cache

Current State Eviction Cache Read Cache Read&M
Read Write Upgrade

Cache Read
If no sharers:
Cache Read&M No Action No Action No Action
I →E
→M →I →I →I
If sharers:
→S

No Action Cache Upgrade No Action Respond shared; No Action No Action

S
→S →M →I →S →I →I

Respond
Respond shared;
No Action No Action No Action shared;
E Supply data;
→E →M →I Supply data;
→S
→I

Cache Write- Respond shared; Respond shared;

No Action Cache Upgrade
O back Supply data; Supply data;
→O →M
→I →O →I

Cache Write- Respond shared; Respond shared;

No Action No Action
M back Supply data; Supply data;
→M →M
→I →O →I

MOESI Example
Thread Event Bus Action Data From Global State local states
C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

3. T2 read→ CR C0 <1,0,1,0> O I S

4. T1 write→ CRM C0 <0,1,0,0> I M I

MOESI Coherence Protocol
 A protocol that tracks validity, ownership, and exclusiveness
 Modified: dirty and private
 Owned: dirty but shared
 Avoid writeback to memory on M->S transitions

 Exclusive: clean but private

 Avoid upgrade misses on private data

 Shared
 Invalid
 There are also some variations (MOSI and MESI)
 What happens when 2 cores read/write different words in a cache
line?
10/25/2017 (© J.P. Shen) 18-600 Lecture #17 58
Snooping with Multi-level Caches
 Private L2 caches
 If inclusive, snooping traffic checked at the L2 level first
 Only accesses that refer to data cached in L1 need to be
forwarded
 Saves bandwidth at the L1 cache

 Shared L2 or L3 caches
 Can act as serialization points even if there is no bus
 Track state of cache line and list of sharers (bit mask)
 Essentially the shared cache acts like a coherence directory

Scaling Coherence Protocols
 The problem
 Too much broadcast traffic for snooping (probing)
 Solution: probe filters
 Maintain info of which address ranges that are definitely not shared or
definitely shared
 Allows filtering of snoop traffic
 Solution: directory based coherence
 A directory stores all coherence info (e.g., sharers)
 Consult directory before sending coherence messages
 Caching/filtering schemes to avoid latency of 3-hops

The Memory Consistency Problem: Example
P1 P2
/*Assume initial value of A and flag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;

 Intuitively, you expect to print A=1

 But can you think of a case where you will print A=0?
 Even if cache coherence is available
 Coherence talks about accesses to a single location
 Consistency is about ordering for accesses to difference locations
 Alternatively
 Coherence determines what value is returned by a read
 Consistency determines when a write value becomes visible

18-600 Foundations of Computer Systems
Lecture 18:
18-600
“Program Performance Optimizations”
SE
John P. Shen & Gregory Kesden
November 1, 2017 PL
OS
CA

➢ Required Reading Assignment:

• Chapter 5 of CS:APP (3rd edition) by Randy Bryant & Dave O’Hallaron.

MP Assignment 1
No ratings yet
MP Assignment 1
9 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Successful Remedies For Early Marriage
No ratings yet
Successful Remedies For Early Marriage
4 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Aiesec: Abbreviations Used in AIESEC Aka. How To Survive The First Weeks in
No ratings yet
Aiesec: Abbreviations Used in AIESEC Aka. How To Survive The First Weeks in
5 pages
Ch. 2 Lecture 1 PDF
No ratings yet
Ch. 2 Lecture 1 PDF
59 pages
Module - 6
No ratings yet
Module - 6
89 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Process Concept: Jobs User Programs Tasks Process
No ratings yet
Process Concept: Jobs User Programs Tasks Process
44 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Multi-Core Computing: Osama Awwad
No ratings yet
Multi-Core Computing: Osama Awwad
37 pages
2 - Parallel Computer Architecture - 1
No ratings yet
2 - Parallel Computer Architecture - 1
26 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Multicore Processor
100% (1)
Multicore Processor
23 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Chapter 3-Processes: Mulugeta A
No ratings yet
Chapter 3-Processes: Mulugeta A
34 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
Mines Paristech / Cri Lal / Cnrs / In2P3
No ratings yet
Mines Paristech / Cri Lal / Cnrs / In2P3
37 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
Getting More Out of Processors: Everyone Wants To Compute Faster, But How?
No ratings yet
Getting More Out of Processors: Everyone Wants To Compute Faster, But How?
8 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Chapter 9 COA
No ratings yet
Chapter 9 COA
31 pages
Lvsuysl Blikr DH Iysv) Píjsa RFKK Ifùk K¡ Fof'Kf"V: HKKJRH Ekud
100% (4)
Lvsuysl Blikr DH Iysv) Píjsa RFKK Ifùk K¡ Fof'Kf"V: HKKJRH Ekud
17 pages
Ancient India 1
No ratings yet
Ancient India 1
105 pages
Mobile SDK Developer Guide
No ratings yet
Mobile SDK Developer Guide
387 pages
Honors Electric Vehicles 2019 Course
No ratings yet
Honors Electric Vehicles 2019 Course
8 pages
Final Report: Multicore Processors
No ratings yet
Final Report: Multicore Processors
12 pages
Osa Multi Core
No ratings yet
Osa Multi Core
37 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
Multi-Core Processing: Advantages & Challenges
No ratings yet
Multi-Core Processing: Advantages & Challenges
35 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
Sunny Days Childrens T Shirt Us
100% (1)
Sunny Days Childrens T Shirt Us
5 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Arkom 13-40275
No ratings yet
Arkom 13-40275
32 pages
Mod 7
No ratings yet
Mod 7
56 pages
DAA NOTES UNIT 1 (Design and Analysis of Algorithm)
No ratings yet
DAA NOTES UNIT 1 (Design and Analysis of Algorithm)
18 pages
L38 TLP
No ratings yet
L38 TLP
13 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Family Business Management Presentation
No ratings yet
Family Business Management Presentation
16 pages
Portable Accommodation Modules Guide Feb20
No ratings yet
Portable Accommodation Modules Guide Feb20
48 pages
How To Package and Deploy SAP Business One Extensions For Lightweight Deployment
No ratings yet
How To Package and Deploy SAP Business One Extensions For Lightweight Deployment
26 pages
Homeworkproblems PDF
No ratings yet
Homeworkproblems PDF
144 pages
Module 2
No ratings yet
Module 2
5 pages
Module 07 - Multiprocessing
No ratings yet
Module 07 - Multiprocessing
60 pages
Music As Persuasive Communication StrategyinAdvertising and Branding
No ratings yet
Music As Persuasive Communication StrategyinAdvertising and Branding
18 pages
TCC Catalog 2017 18
No ratings yet
TCC Catalog 2017 18
186 pages
Contact Us - WBM International Online Shopping in Pakistan
No ratings yet
Contact Us - WBM International Online Shopping in Pakistan
1 page
EE457Unit9c CMT
No ratings yet
EE457Unit9c CMT
60 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
StraMa Comprehensive Guidelines (C1 To C8) PDF
No ratings yet
StraMa Comprehensive Guidelines (C1 To C8) PDF
103 pages
Q2 Project Instructions
No ratings yet
Q2 Project Instructions
12 pages
4TB 3520203
No ratings yet
4TB 3520203
1 page
Conference Coordinator-OMICS International
No ratings yet
Conference Coordinator-OMICS International
2 pages
L20 EmbeddedMultiprocessors
No ratings yet
L20 EmbeddedMultiprocessors
29 pages
Figlet
No ratings yet
Figlet
10 pages
Giverny Capital - Annual Letter 2018 Web PDF
No ratings yet
Giverny Capital - Annual Letter 2018 Web PDF
18 pages
Lecture 25
No ratings yet
Lecture 25
41 pages
PC1015
No ratings yet
PC1015
13 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Buy Social Security Number SSN
No ratings yet
Buy Social Security Number SSN
8 pages
Unit 5
No ratings yet
Unit 5
86 pages
MULTIPROCTLPA
No ratings yet
MULTIPROCTLPA
99 pages
Wiljam Flight Training: 050-01-01 Composition, Extent, Vertical Division
No ratings yet
Wiljam Flight Training: 050-01-01 Composition, Extent, Vertical Division
18 pages
Ayushagrawal HPC
No ratings yet
Ayushagrawal HPC
17 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Syllabus 0413201 - Web Applications Development - 1st Sem 2024 - 2025
No ratings yet
Syllabus 0413201 - Web Applications Development - 1st Sem 2024 - 2025
6 pages
RR Infra Girders Launching
No ratings yet
RR Infra Girders Launching
1 page
Chapter 8 - Parallel Processing
No ratings yet
Chapter 8 - Parallel Processing
50 pages
( ) 2024 7.life in Space - ( ) 2 (25 ) (Q)
No ratings yet
( ) 2024 7.life in Space - ( ) 2 (25 ) (Q)
8 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
03 TLP
No ratings yet
03 TLP
33 pages
Medisin The Causes Solutions To Disease Malnutrition and The Medical Sins That Are Killing The World 1st Scott Whitaker PDF Download
No ratings yet
Medisin The Causes Solutions To Disease Malnutrition and The Medical Sins That Are Killing The World 1st Scott Whitaker PDF Download
82 pages
Intro To Java Programming Comprehensive Version 10th Edition by Y Daniel Liang
No ratings yet
Intro To Java Programming Comprehensive Version 10th Edition by Y Daniel Liang
315 pages
The Shahnameh: The Persian Epic in World Literature Hamid Dabashi Download
100% (2)
The Shahnameh: The Persian Epic in World Literature Hamid Dabashi Download
59 pages