0% found this document useful (0 votes)
73 views

ARM Multi Core Processing

Multicore processors contain multiple CPU cores on a single chip connected by an interconnect. This allows parallel processing to improve performance. There are challenges in communication between cores and ensuring data consistency. Cores can communicate via either message passing or shared memory. Cache coherence protocols like MESI ensure cores always access the most up-to-date data values when using shared memory. The MESI protocol defines states like Modified, Exclusive, Shared, and Invalid for cache blocks to manage coherence across cores.

Uploaded by

Gurram Kishore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

ARM Multi Core Processing

Multicore processors contain multiple CPU cores on a single chip connected by an interconnect. This allows parallel processing to improve performance. There are challenges in communication between cores and ensuring data consistency. Cores can communicate via either message passing or shared memory. Cache coherence protocols like MESI ensure cores always access the most up-to-date data values when using shared memory. The MESI protocol defines states like Modified, Exclusive, Shared, and Invalid for cache blocks to manage coherence across cores.

Uploaded by

Gurram Kishore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Multicore Processors

Module 8

© 2021 Arm Limited


Module Syllabus
• Multicore
• Communication
• Message passing
• Shared memory
• Cache-coherence protocol
• MESI protocol states
• MESI protocol transitions
• Visualizing the protocol
• MESI state transition diagram
• Memory consistency

2 © 2021 Arm Limited


Motivation
• Moore’s law meant that the number of transistors available to an architect kept
increasing.
• Historically, these were used to improve the performance of a single core processor.
• Usually through increased speculation support, but this gives sublinear performance improvement
• However, the breakdown of Dennard scaling meant these schemes were no longer
viable.
• If they consumed large amounts of power without giving commensurate performance improvements.
• Multicore architectures are an efficient way of using these transistors instead.
• Performance comes from parallelism, specifically thread-level or process-level parallelism.
• Note that we had multi-processor systems long before Dennard scaling failed; this just pushed them to
mainstream.
• What are the challenges in providing multiple cores and how do they communicate?

3 © 2021 Arm Limited


Multicore

Overview An example of multicore system

• In a multicore processor, multiple CPU


cores are provided within the chip. CPU CPU CPU CPU
• Cores are connected together through core core core core

some form of interconnect.


• Cores share some components on Interconnect
chip.
• For example, memory interface, or a cache
I/O RAM Peripheral Timer
level

4 © 2021 Arm Limited


Multicore
• Cores in a multicore processor are connected together and can collaborate.
• There are a number of challenges to consider when creating a system like this.
• How do cores communicate with each other?
• How is data synchronized?
• How do we ensure that cores don’t get stale data when it’s been modified by other cores?
• How do cores see the ordering of events coming from different cores?
• We’ll explore each of these by considering the concepts of
• Shared memory and message passing
• Cache coherence
• Memory consistency

5 © 2021 Arm Limited


Communication

6 © 2021 Arm Limited


Communication

Message Passing Shared Memory

• In this paradigm, applications running • In this paradigm, there is a shared


on each core wrap up the data they memory address space accessible to
want to communicate into messages all cores, where they can read and
sent to other cores. write data.
• Explicit communication via send and • Implicit communication via memory
receive operations accesses
• Synchronization is implicit via blocks of • Synchronization is performed using
messages. atomic memory operations.

7 © 2021 Arm Limited


Message Passing
• Cores do not rely on a shared
memory space.
• Each processing element (PE) has
its core, data, I/O.
• Explicit I/O to communicate with other
cores
• Synchronization via sending and Core Core Core
receiving messages
• Advantages Cache Cache . . . Cache
• Less hardware, easier to design
• Focuses attention on costly non-local
operations Memory Memory Memory

Interconnection network

8 © 2021 Arm Limited


Shared Memory
• Cores share a memory that they see as a single address space.
• Cores may have caches holding data.
• Communication involves reading and writing to locations in memory.
• Synchronization via atomic operations to modify memory
– Specific instructions provided in the ISA
– The hardware guarantees correct operation
• Advantages Core Core Core
• Matches the programmer’s view of the system
• Hardware handles communication implicitly
Cache Cache . . . Cache

Interconnection network

Memory

9 © 2021 Arm Limited


Shared Memory with Caches
• Caching shared-memory data has to be
handled carefully.
• The cache stores a copy of some of the
data in memory. Core Core
• In particular, writes to the data must Cache coherence
applies to this level
be dealt with correctly. Cache Cache of the cache
• A core may write to data in its own cache. hierarchy.
• This makes copies in other caches become Interconnection
stale. network
• The version in memory won’t get updated
immediately either, if it’s a write-back Shared cache
cache.
• In these situations, there is a danger of one
core subsequently reading a stale (old) Memory
value.

10 © 2021 Arm Limited


Cache Coherence

11 © 2021 Arm Limited


Cache-coherence Protocol
• The cache-coherence protocol ensures that cores always read the
most up-to-date values.
• This means a core can always find the most recent value for some data, no
matter where it is.
• It describes the actions to take on seeing certain events from other cores.
Core Core
• Cache coherence is required when caches share some memory.
• The protocols rely on caches seeing events from other cores.
• In particular, reads and writes to the shared memory Cache Cache
• Often this is realized through snooping operations on the interconnect.
Interconnection
• The protocol runs on each block in the cache independently. network
• This is the granularity commercial implementations track the state
information. Cache or memory
• It runs in the private caches that have a shared ancestor.
• The orange caches in the diagram
12 © 2021 Arm Limited
MESI Cache-coherence Protocol
• The MESI protocol is a write-invalidate cache-coherence protocol.
• When writing to a shared location, the related cache block is invalidated in the caches of all other
cores.
• This protocol can manage cache coherence for a specified memory area.
• In general, uses an allocate-on-write cache policy
• New data are loaded into the cache on both read and write misses.
• In general, uses a write-back cache
• So caches can store data that are more recent than the value in memory.
• The MESI protocol defines states for each cache block and transitions between them.
• Four states, corresponding to M, E, S, and I

13 © 2021 Arm Limited


Modified State (M)

• The local cache holds


the only copy of the Local core cache Remote core cache
block, which is also the
most recent version of Current M
the data; memory holds value
old (stale) data.
• Note that memory may
actually be a shared Not coherent
cache between the
core’s caches and main Not
memory. current
value
Memory
14 © 2021 Arm Limited
Exclusive State (E)

• The local cache


holds the only Local core cache Remote core cache
copy of the block,
which is identical E
Current
to memory’s value
version.

Coherent

Current
value
Memory
15 © 2021 Arm Limited
Shared State (S)

• The local cache


holds a copy of Local core cache Remote core cache
the block, which
is identical to Current
S Coherent S
Current
memory’s value value
version; other
caches may also
hold the block in Coherent Coherent
shared state.

Current
value
Memory
16 © 2021 Arm Limited
Invalid State (I)

• The local cache


does not hold a Local core cache Remote core cache
copy of the block.
• This state may not
be marked in the
cache for each
block; it could be
inferred by a
cache miss on that
block.

Memory
17 © 2021 Arm Limited
Invalid to Modified

• Occurs when the Local core cache Remote core cache Local core cache Remote core cache

local core
attempts to write M
some data to an
address not
already in the
2
cache 1
1. Read-exclusive
request
2. Data response
Memory Memory

Before After
18 © 2021 Arm Limited
Invalid to Exclusive

• Occurs when the Local core cache Remote core cache Local core cache Remote core cache

core attempts to
read data from an E
address that is
not already in the
cache and no
2
other cache has it 1
1. Read request
2. Data response

Memory Memory

Before After
19 © 2021 Arm Limited
Invalid to Shared
• Occurs when the
local core Local core cache Remote core cache Local core cache Remote core cache
attempts to read
data from an E S S
address that is
not already in the 2
cache, but other
caches have a
copy 1
• Data is supplied
by another cache.
1. Read request Memory Memory
2. Data response Before After
20 © 2021 Arm Limited
Shared to Modified

• Occurs when the local Local core cache Remote core cache Local core cache Remote core cache

core attempts to write


some data to an S S X2 M
address that is already
in the cache, and other 1
caches may have a
copy
• Needs to invalidate
other copies
1. Read-exclusive
request Memory Memory

2. Invalidate Before After


21 © 2021 Arm Limited
Shared to Invalid

• Occurs when Local core cache Remote core cache Local core cache Remote core cache

another core
attempts to write S X2 S M
some data to an
address that is 1
already in the cache
• The local cache
snoops the exclusive
read request.
1. Read-exclusive
request Memory Memory

2. Invalidate Before After


22 © 2021 Arm Limited
Exclusive to Modified

• Occurs when the Local core cache Remote core cache Local core cache Remote core cache

local core
attempts to write E M
some data to an
address that is
already in the
cache, and that’s
the only copy
• No need to
invalidate other
caches because Memory Memory
we know they Before After
23
don’t have a copy
© 2021 Arm Limited
Exclusive to Shared

• Occurs when another Local core cache Remote core cache Local core cache Remote core cache

core attempts to
read data from an E S S
address that this
cache has, and it’s 2
the only copy
• Data are supplied by 1
the cache after
snooping the read
request.
1. Read request Memory Memory

2. Data response Before After


24 © 2021 Arm Limited
Exclusive to Invalid
• Occurs when Local core cache Remote core cache Local core cache Remote core cache

another core
attempts to write E X 3 M
some data to an
address that this 2
cache has the only
copy of 1
• The local cache
snoops the exclusive
read request.
1. Read-exclusive Memory Memory

request Before After


2. Data response
3. Invalidate
25 © 2021 Arm Limited
Modified to Shared
• Occurs when Local core cache Remote core cache Local core cache Remote core cache

another core
attempts to read M S S
data from an
address that this 2
cache has written
to 1
• Must flush the
data back to
memory and the
requesting cache Memory Memory

1. Read request Before After


2. Data response
26 © 2021 Arm Limited
Modified to Invalid
• Occurs when Local core cache Remote core cache Local core cache Remote core cache

another core
attempts to write M X3 M
some data to an
address that this 2
cache has altered
• Must flush the 1
data back to
memory and the
requesting cache
1. Read-exclusive Memory Memory

request Before After


2. Data response
3. Invalidate
27 © 2021 Arm Limited
Visualizing the Protocol

• We can visualize the protocol in two


ways. M E S I
• First, a table showing the valid
combinations of states that two caches M ⨯ ⨯ ⨯ ✓
can have for the same block
• For example, two caches can have it in E ⨯ ⨯ ⨯ ✓
shared.
• But if any cache has it in exclusive or
modified, then all other caches are invalid. S ⨯ ⨯ ✓ ✓
• Second, we can draw a state transition
diagram to show all events and actions. I ✓ ✓ ✓ ✓

28 © 2021 Arm Limited


MESI State-transition Diagram
Remote core
write

Remote core Remote core


read write

Remote core Remote core


read write

Local/remote Remote
Local core
read/write M Local core
read E core read
S core
read/write I
Local core
Local core read, remote
write core has a
copy
Local core
Local core
read, no other
write
copies
Local core
29 © 2021 Arm Limited write
Memory Consistency

30 © 2021 Arm Limited


Memory Consistency
• The purpose of cache coherence is to ensure data propagation and coherence.
• So when data are written by one core, all other cores can later read the correct value.
• When a core attempts to write, others know that their copies are stale.
• When a core attempts to read, others know they must provide their data, if modified.
• The cache-coherence protocol is run independently on each block of data.
• There is no direct interaction between different blocks, as far as the protocol is concerned.
• So what about the order in which data accesses by one core are seen by others?
• If a core performs certain reads and writes to different data, in what order do other cores see them?
• It is the purpose of the memory-consistency model to define this.
• And the job of the memory hierarchy (and core) to implement it

31 © 2021 Arm Limited


Memory Consistency
• Modern processors may reorder memory operations.
• Out-of-order processing can allow loads to access the cache ahead of older stores.
• Either because the addresses they access don’t match
• Or because the load has been speculatively executed and will be replayed later if a dependence is
found
• This avoids stalling loads unnecessarily, even though their effects can be seen externally.
• By other cores in the system
• Recall module 5 where this was introduced.

32 © 2021 Arm Limited


Memory Consistency vs Cache Coherence

Cache coherence Memory consistency

Core 1 Core 2 Core 1 Core 2

Accesses issued Accesses issued Accesses issued Core 1’s accesses seen
Store B Store C Store B Load A
Load A Load A Load A Load D

Reordered
Store C Load B Store C Store B
Store A Load C Store A Store C
Load C Store D Load C Load C
Load D Load A Load D Store A

Data propagation
33 © 2021 Arm Limited
Memory Consistency
• The memory consistency model defines valid outcomes of sequences of accesses of the
different cores.
• Sequential consistency (SC) is the strongest and most intuitive model.
• The operations of each core occur in program order, and these are interleaved (at some granularity)
across all cores.
• This means that no loads or stores can bypass other loads or stores.
• SC is overly strong because it prevents many useful optimizations without being needed by most
programs.
• Total store order (TSO) is widely implemented (e.g., x86 architectures).
• The same as sequential consistency but allows a younger load to observe a state of memory in which
the effects of an older store have not yet become observable
• Forms of relaxed consistency have been adopted (e.g., Arm and PowerPC architectures).
• In more relaxed consistency models, other constraints in SC are removed, such as a younger load
observing a state of memory before an older load does.

34 © 2021 Arm Limited


Case Study: Cortex-A9 MPCore
Contains up to four Cortex-A9 processors
Snoop Control Unit
• Maintains L1 data
cache coherency
across processors
• Arbitrates
accesses to the L2
memory system,
through one or
two external 64-
bit AXI manager
interfaces

35 © 2021 Arm Limited


Case Study: Heterogeneous Multicore
• big.LITTLE is a heterogeneous processing
architecture with two types of cores.
• “big” cores for high compute performance
• “LITTLE” cores for power efficiency
• L1 and L2 memory system in cores
• A DynamIQ system contains big and LITTLE
cores and a shared unit containing.
• L3 memory system
• Control logic
• External interfaces

36 © 2021 Arm Limited


Case Study: Cortex-A55
• The Cortex-A55 is a LITTLE core.

• Optionally contains an L2 cache Core 7*


Core 6*
Core 5*

• Uses the MESI protocol for coherence


Core 4*
Core 3*
Core 2*

• M: Modified/UniqueDirty (UD) – the line is in only Core 1*


Core 0

this cache and is dirty. L1 memory system

• E: Exclusive/UniqueClean (UC) – the line is in only this IFU Click and type. Right-click to select fill col or.
L1 Click and
Instruction
L1type. Right-click to select fill col or.
Data
Click and type. Right-click to select fill color.

cache and is clean.


Cache Cache
Click and type. Right-click to select fill color.
DPU Click and type. Right-click to select fill color.

• S: Shared/SharedClean (SC) – the line is possibly in PMU


Click and type. Right-clickSTB
DCU to select fill col or.
Click and type. Right-click to select fill color.

other caches and is clean. NEON* BIU MMU

• I: Invalid/Invalid (I) – the line is not in this cache.


• The data-cache unit (DCU) stores the MESI state L2 cache*

of each cache line. ETM ELA* GIC CPU interface

DSU Asynchronous Bridges

DSU SCU and L3

* Optional
37 © 2021 Arm Limited
Conclusions
• Multicore processors provide performance from increasing numbers of transistors.
• Performance comes through thread-level parallelism.
• Shared-memory systems are the most common paradigm.
• Cores share a memory and common address space.
• Data written by one core are read by others when accessing the same location.
• Dealing with shared memory in the presence of caches poses a challenge.
• This is where the cache-coherence protocol comes into play.
• We looked at the MESI protocol, but there are other more simple and more complex protocols
around.
• Memory consistency defines the order that reads/writes to different addresses are seen
by the different cores in the system.

38 © 2021 Arm Limited

You might also like