0% found this document useful (0 votes)
25 views27 pages

Lect10 SMPCC

Shared memory architectures allow multiple CPUs to access a shared pool of memory. This can cause cache coherence problems if different CPUs have conflicting copies of data in their caches. Cache coherence protocols like MSI resolve this by maintaining consistency between caches and main memory through snooping or a directory-based approach.

Uploaded by

bassam abutraab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views27 pages

Lect10 SMPCC

Shared memory architectures allow multiple CPUs to access a shared pool of memory. This can cause cache coherence problems if different CPUs have conflicting copies of data in their caches. Cache coherence protocols like MSI resolve this by maintaining consistency between caches and main memory through snooping or a directory-based approach.

Uploaded by

bassam abutraab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Shared Memory Architectures

 Introduce different shared memory architectures


 Introduce the cache coherence problem and cache coherence
protocol.
Shared memory architectures

 Multiple CPU’s (or cores)


 One memory with a global address space
o May have many modules (to increase memory bandwidth)
o All CPUs access all memory through the global address space

 All CPUs can make changes to the shared memory


o Changes made by one processor are visible to all other
processors.

 Data parallelism or function parallelism? MIMD


Major issue: How to connect CPU and memory

 Ideal effect:
o Large memory bandwidth
o Low memory latency

 When accessing remote objects,


bandwidth and latency are always
key metrics. Think of the user
experience when downloading small
files versus a very large file.
Shared memory architectures: UMA and NUMA

• One large memory


– One the same side of the interconnect
• Mostly Bus
– Memory references from different CPUs have the same
distance (latency)
– Uniform memory access (UMA)

• Many small memories


– Local and remote memory
– Memory references from different CPUs have different
distance to memory location (latency is different)
– Non-uniform memory access (NUMA)
Bus-based UMA Shared memory architecture
 Many processors and memory modules connect to the bus
o This architecture dominated in the server domain in the past.
 Faster processors began to saturate the bus as the bus technology
cannot keep up with CPU processing power
 Bus interconnect may also be replaced by an crossbar interconnect,
but that is expensive.
NUMA Shared memory architecture

 Identical processors, processors have


different time for accessing different part of
the memory.
 Memory resides in “NUMA domains.”
 The current generation SMPs adopt the
NUMA architecture. AMD EPYC multi-chip
module (MCM) processor is shown in the
left. Memory is distributed to each module.
Various SMP hardware organizations
Cache Coherence Problem

Due to the cache copies of the memory, different processors may see
the different values of the same memory location.
 Processors see different values for u after event 3.
With a write-back cache, memory may store the stale date.
Bus Snoopy Cache Coherence protocols
 Memory: centralized with uniform access time and bus interconnect.
Bus Snooping idea

 Send all requests for data to all processors (through the bus)
 Processors snoop to see if they have a copy and respond accordingly.
o Cache listens to both CPU and BUS.
o The state of a cache line may change by (1) CPU memory operation, and (2)
bus transaction (remote CPU’s memory operation).
 Requires broadcast since caching information is at processors.
o Bus is a natural broadcast medium.
o Bus (centralized medium) also serializes requests.
Types of snoopy bus protocols
 Write invalidate protocols
o Write to shared data: an invalidate is sent to the bus (all caches snoop and
invalidate copies).
 Write broadcast protocols (typically write through)
o Write to shared data: broadcast on bus, processors snoop and update any
copies.
An Example Snoopy Protocol (MSI)
 Invalidation protocol, write-back cache
 Each block of memory is in one state
o Clean in all caches and up-to-date in memory (shared)
o Dirty in exactly one cache (exclusive)
o Not in any cache

 Each cache block is in one state:


o Shared: block can be read
o Exclusive: cache has only copy, its writable and dirty
o Invalid: block contains no data.

 Read misses: cause all caches to snoop bus (bus transaction)


 Write to a shared block is treated as misses (needs bus transaction).
MSI PROTOCOL STATE MACHINE FOR CPU
REQUESTS
MSI PROTOCOL STATE MACHINE FOR CPU
REQUESTS
MSI PROTOCOL STATE MACHINE FOR CPU
REQUESTS
MSI cache coherence protocol variations

 Basic MSI Protocol


o Three states: MSI.
o Can optimize by refining the states so as to reduce the bus transactions in
some cases.
 Berkeley protocol
o Five states, M  owned, exclusive, owned shared.
 MESI protocol (four states)
o M  modified and Exclusive.
 MESIF – protocol used in Intel processors
o MESI + S S and F (Cache should be the responder for a request)
 MOESI – protocol used in AMD processors
Multiple levels of caches

 Most processors today have on-chip L1 and L2 caches.


 Transactions on L1 cache are not visible to bus (needs separate snooper
for coherence, which would be expensive).
 Typical solution:
o Maintain inclusion property on L1 and L2 cache so that all bus transactions
that are relevant to L1 are also relevant to L2: sufficient to only use the L2
controller to snoop the bus.
o Propagating transactions for coherence in the hierarchy.
Large share memory multiprocessors

• The interconnection network is usually not a bus.


• No broadcast medium  cannot snoop.
• Needs a different kind of cache coherence protocol.
Cache coherence for large SMPs

 Similar idea as MSI protocol, but the interconnect does not have broadcast.
o Use a directory to record where (who the owner is) of each memory line.

 Use a directory for each cache line to track the state of every block in the
cache.
o Can also track the state for all memory blocks  directory size = O(memory size).

 Need to used distributed directory


o Centralized directory becomes the bottleneck.
 Who is the central authority for a given cache line?

 Typically called cc-NUMA multiprocessors


Performance implication of shared memory
architecture
 NUMA architecture can have very large impact on performance
 Cache coherence protocol can also have impacts
o Memory write is even more expensive
o False sharing issue
 One thread’s cache behavior can affect other thread’s performance
Summary

 Share memory architectures


o UMA and NUMA
o Bus based systems and interconnect based systems

 Cache coherence problem


 Cache coherence protocols
o Snoopy bus
o Directory based

You might also like