0% found this document useful (0 votes)
48 views8 pages

Multiprocessing: Flynn's Classification (1966)

1. The document discusses Flynn's classification of parallel computing systems including SISD, SIMD, MISD, and MIMD architectures. 2. It describes centralized shared memory multiprocessors and distributed memory multiprocessors, noting that shared memory multiprocessors can scale to a few dozen processors while distributed memory can scale larger. 3. The document outlines two models for communication and memory architecture: message passing and shared memory, noting shared memory introduces cache coherence challenges to maintain a consistent view of memory.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views8 pages

Multiprocessing: Flynn's Classification (1966)

1. The document discusses Flynn's classification of parallel computing systems including SISD, SIMD, MISD, and MIMD architectures. 2. It describes centralized shared memory multiprocessors and distributed memory multiprocessors, noting that shared memory multiprocessors can scale to a few dozen processors while distributed memory can scale larger. 3. The document outlines two models for communication and memory architecture: message passing and shared memory, noting shared memory introduces cache coherence challenges to maintain a consistent view of memory.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

11/1/2010

Flynn’s Classification (1966)

IT6030 Broad classification of parallel computing systems


Advanced Computer Architecture • SISD: Single Instruction, Single Data
– conventional uniprocessor

Chapter 5 • SIMD: Single Instruction, Multiple Data


– one instruction stream, multiple data paths
Multiprocessing – distributed memory SIMD
– shared memory SIMD (vector computers)
• MISD: Multiple Instruction, Single Data
Nguyen Kim Khanh – Not a practical configuration
Department of Computer Engineering • MIMD: Multiple Instruction, Multiple Data
School of Information and Communication Technology
– shared memory machines
Hanoi University of Science and Technology
– message passing machines

Parallel Architecture Centralized Shared Memory Multiprocessor

• “A parallel computer is a collection of processing


elements that cooperate and communicate to solve
large problems fast.”
• Parallel Architecture = Computer Architecture +
Communication Architecture
• 2 classes of multiprocessors:
1. Centralized Memory Multiprocessor
• < few dozen processor chips (and < 100 cores) in 2006
• Small enough to share single, centralized memory
2. Physically Distributed-Memory multiprocessor
• Larger number chips and cores than 1.
• BW demands ⇒ Memory distributed among processors

IT6030 1
11/1/2010

Centralized Shared Memory Multiprocessor Distributed Memory Multiprocessor

• Also called symmetric multiprocessors (SMPs)


because single main memory has a symmetric
relationship to all processors
• Large caches ⇒ single memory can satisfy
memory demands of small number of processors
• Can scale to a few dozen processors by using a
switch and by using many memory banks
• Although scaling beyond that is technically
conceivable, it becomes less attractive as the
number of processors sharing centralized memory
increases

2 Models for Communication and Memory


Architecture Symmetric Shared-Memory Architectures
1. Communication occurs by explicitly passing messages • From multiple boards on a shared bus to
among the processors: multiple processors inside a single chip
message-passing multiprocessors
• Caches both
2. Communication occurs through a shared address space
(via loads and stores): – Private data are used by a single processor
shared memory multiprocessors either – Shared data are used by multiple processors
• UMA (Uniform Memory Access time) for shared • Caching shared data
address, centralized memory MP
⇒ reduces latency to shared data, memory
• NUMA (Non Uniform Memory Access time bandwidth for shared data,
multiprocessor) for shared address, distributed
memory MP
and interconnect bandwidth
⇒ cache coherence problem
• In past, confusion whether “sharing” means sharing
physical memory (Symmetric MP) or sharing address
space

IT6030 2
11/1/2010

Example Cache Coherence Problem Intuitive Memory Model

P1 P2 P3 P
u=?
u=? 3
L1 • Reading an address
$
4
$ 5 $
100:67
should return the last
u :5 u :5 u= 7 L2
100:35
value written to that
address
Memory – Easy in uniprocessors, except
for I/O
1 I/O devices Disk 100:34
2
u :5
Memory
• Too vague and simplistic; 2 issues
1. Coherence defines values returned by a read
– Processors see different values for u after event 3
– With write back caches, value written back to memory depends on
2. Consistency determines when a written value will be
happenstance of which cache flushes or writes back value when returned by a read
» Processes accessing main memory may see very stale value • Coherence defines behavior to same location,
– Unacceptable for programming, and its frequent! Consistency defines behavior to other locations

Defining Coherent Memory System Write Consistency


1. Preserve Program Order: A read by processor P to location X
that follows a write by P to X, with no writes of X by another • For now assume
processor occurring between the write and the read by P, 1. A write does not complete (and allow the next write to
always returns the value written by P
occur) until all processors have seen the effect of that
2. Coherent view of memory: Read by a processor to location X write
that follows a write by another processor to X returns the
written value if the read and write are sufficiently separated in 2. The processor does not change the order of any write
time and no other writes to X occur between the two accesses with respect to any other memory access
3. Write serialization: 2 writes to same location by any 2
processors are seen in the same order by all processors
⇒ if a processor writes location A followed by location B,
– If not, a processor could keep value 1 since saw as last write
any processor that sees the new value of B must also
– For example, if the values 1 and then 2 are written to a location,
see the new value of A
processors can never read the value of the location as 2 and then • These restrictions allow the processor to reorder
later read it as 1
reads, but forces the processor to finish writes in
program order

IT6030 3
11/1/2010

Basic Schemes for Enforcing Coherence 2 Classes of Cache Coherence Protocols


1. Directory based — Sharing status of a block of
• Program on multiple processors will normally have copies physical memory is kept in just one location, the
of the same data in several caches directory
– Unlike I/O, where its rare
2. Snooping — Every cache with a copy of data also
• Rather than trying to avoid sharing in SW, has a copy of sharing status of block, but no
SMPs use a HW protocol to maintain coherent caches centralized state is kept
– Migration and Replication key to performance of shared data
• All caches are accessible via some broadcast medium (a
• Migration - data can be moved to a local cache and used bus or switch)
there in a transparent fashion • All cache controllers monitor or snoop on the medium to
– Reduces both latency to access shared data that is allocated remotely and determine whether or not they have a copy of a block that is
bandwidth demand on the shared memory requested on a bus or switch access
• Replication – for shared data being simultaneously read,
since caches make a copy of data in local cache
– Reduces both latency of access and contention for read shared data

Snoopy Cache-Coherence Protocols Example: Write-thru Invalidate

P1 P2 P3
State P1 Pn
Bus snoop
Address u=? 3
u=?
Data 4
$ $
$ $ 5 $

u :5 u :5 u= 7
Cache-memory
I/O devices transaction
Mem

• Cache Controller “snoops” all transactions on the 1


2
I/O devices

shared medium (bus or switch) u :5 u=7


– relevant transaction if for a block it contains Memory

– take action to ensure coherence


» invalidate, update, or supply value
– depends on state of the block and the protocol
• Must invalidate before step 3
• Either get exclusive access before write via write • Write update uses more broadcast medium BW
invalidate or update all copies on write ⇒ all recent MPUs use write invalidate

IT6030 4
11/1/2010

Architectural Building Blocks Locate up-to-date copy of data

• Cache block state transition diagram • Write-through: get up-to-date copy from memory
– FSM specifying how disposition of block changes – Write through simpler if enough memory BW
» invalid, valid, dirty • Write-back harder
• Broadcast Medium Transactions (e.g., bus) – Most recent copy can be in a cache
– Fundamental system design abstraction
– Logically single set of wires connect several devices
• Can use same snooping mechanism
– Protocol: arbitration, command/addr, data 1. Snoop every address placed on the bus
⇒ Every device observes every transaction
2. If a processor has dirty copy of requested cache block, it
• Broadcast medium enforces serialization of read or provides it in response to a read request and aborts the
write accesses ⇒ Write serialization memory access
– 1st processor to get medium invalidates others copies
– Implies cannot complete write until it obtains bus
– Complexity from retrieving cache block from a processor cache,
which can take longer than retrieving it from memory
– All coherence schemes require serializing accesses to same cache
block
• Write-back needs lower memory bandwidth
• Also need to find up-to-date copy of cache block ⇒ Support larger numbers of faster processors
⇒ Most multiprocessors use write-back

Cache Resources for WB Snooping Cache Resources for WB Snooping


• Normal cache tags can be used for snooping • To track whether a cache block is shared, add extra
• Valid bit per block makes invalidation easy state bit associated with each cache block, like valid
bit and dirty bit
• Read misses easy since rely on snooping
– Write to Shared block ⇒ Need to place invalidate on bus and
• Writes ⇒ Need to know if know whether any other mark cache block as private (if an option)
copies of the block are cached – No further invalidations will be sent for that block
– No other copies ⇒ No need to place write on bus for WB – This processor called owner of cache block
– Other copies ⇒ Need to place invalidate on bus – Owner then changes state from shared to unshared (or
exclusive)

IT6030 5
11/1/2010

Cache behavior in response to bus Example Protocol


• Snooping coherence protocol is usually implemented
• Every bus transaction must check the cache-address by incorporating a finite-state controller in each node
tags
– could potentially interfere with processor cache accesses • Logically, think of a separate controller associated
with each cache block
• A way to reduce interference is to duplicate tags
– That is, snooping operations or cache requests for different blocks
– One set for caches access, one set for bus accesses can proceed independently
• Another way to reduce interference is to use L2 tags • In implementations, a single controller allows multiple
– Since L2 less heavily used than L1 operations to distinct blocks to proceed in interleaved
⇒ Every entry in L1 cache must be present in the L2 cache, called the fashion
inclusion property
– that is, one operation may be initiated before another is completed,
– If Snoop gets a hit in L2 cache, then it must arbitrate for the L1 cache to even through only one cache access or one bus access is allowed
update the state and possibly retrieve the data, which usually requires a at time
stall of the processor

Write-through Invalidate Protocol Is 2-state Protocol Coherent?


PrRd/ -- • Processor only observes state of memory system by issuing
• 2 states per block in each cache PrWr / BusWr memory operations
– as in uniprocessor V • Assume bus transactions and memory operations are atomic and a
– state of a block is a p-vector of states BusWr / - one-level cache
– Hardware state bits associated with blocks – all phases of one bus transaction complete before next one starts
that are in the cache – processor waits for memory operation to complete before issuing next
PrRd / BusRd – with one-level cache, assume invalidations applied during bus transaction
– other blocks can be seen as being in invalid
(not-present) state in that cache I • All writes go to bus + atomicity
• Writes invalidate all other cache PrWr / BusWr – Writes serialized by order in which they appear on bus (bus order)
=> invalidations applied to caches in bus order
copies
• How to insert reads in this order?
– can have multiple simultaneous readers of
– Important since processors see writes through reads, so determines whether
block,but write invalidates them State Tag Data State Tag Data
write serialization is satisfied
– But read hits may happen independently and do not appear on bus or enter
PrRd: Processor Read P1 Pn directly in bus order
PrWr: Processor Write
$ $ • Let’s understand other ordering issues
BusRd: Bus Read Bus
BusWr: Bus Write
Mem I/O devices

IT6030 6
11/1/2010

Ordering Example Write Back Snoopy Protocol

• Invalidation protocol, write-back cache


P0 : R R R W R R – Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response to
the read request and aborts the memory access
P1 : R R R R R W
• Each memory block is in one state:
– Clean in all caches and up-to-date in memory (Shared)
P2 : R R R R R R
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches
• Each cache block is in one state (track these):
• Writes establish a partial order – Shared : block can be read
• Doesn’t constrain ordering of reads, though – OR Exclusive : cache has only copy, its writeable, and dirty
shared-medium (bus) will order read misses too – OR Invalid : block contains no data (in uniprocessor cache too)
– any order among reads between writes is fine, • Read misses: cause all caches to snoop bus
as long as in program order
• Writes to clean blocks are treated as misses

Write-Back State Machine - CPU Write-Back State Machine- Bus request


CPU Read hit
• State machine
• State machine for bus requests Write miss
for CPU requests CPU Read Shared for each for this block Shared
for each Invalid cache block Invalid
(read/only) (read/only)
cache block Place read miss
• Non-resident blocks on bus
invalid
CPU Write
Place Write
Miss on bus Write miss
for this block
Write Back Read miss
CPU Write Block; (abort for this block
Cache Block Place Write Miss on Bus memory access) Write Back
State Exclusive Block; (abort
(read/write) Exclusive memory access)
CPU read hit
(read/write)
CPU write hit CPU Write Miss (?)
Write back cache block
Place write miss on bus

IT6030 7
11/1/2010

Block-replacement Write-back State Machine-III


CPU Read hit CPU Read hit
• State machine Write miss
• State machine for CPU requests
for CPU requests CPU Read for this block
Shared for each Shared
for each Invalid Invalid CPU Read
cache block (read/only) cache block and (read/only)
Place read miss Place read miss
for bus requests
on bus CPU Write on bus
for each
cache block Place Write
CPU Write Miss on bus
CPU read miss Write miss CPU read miss
Place Write CPU Read miss CPU Read miss
Write back block, for this block Write back block,
Miss on bus Place read miss Place read miss
Place read miss on bus Write Back Place read miss on bus
on bus Block; (abort on bus CPU Write
CPU Write memory access) Place Write Miss on Bus
Cache Block Place Write Miss on Bus Cache Block Read miss Write Back
State Exclusive State Exclusive for this block Block; (abort
(read/write) (read/write) memory access)
CPU read hit
CPU write hit CPU Write Miss CPU read hit CPU Write Miss
Write back cache block CPU write hit Write back cache block
Place write miss on bus Place write miss on bus

IT6030 8

You might also like