Parallel Computer Architecture A Hardware-Software
Parallel Computer Architecture A Hardware-Software
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220692283
CITATIONS READS
973 1,504
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by David E. Culler on 20 May 2014.
Homework
Write two procedures -- Put_Task() and
Get_Task() -- in any language, e.g. C, and
using Fetch&Add for synchronization as
follows
• Define TD[1..n], the ToDo array; n=c*Processors
for some c
• Define FF, for first free, pointing in TD to first free cell
• Define NA, for next available, pointing to next task in TD
• Put_Task(a) takes a task as input, and places it in TD;
Get_Task() returns the next task from TD if there is one
• The management of TD is completely decentralized
1
Homework
Strategy: The TD[1..n] is several times larger
than the number of processors; system
assumes at least 1 producer and 1 consumer
Put_Task(a)
Put_Task(a){{ Put waits if the slot is
slot = Fetch&Add(FF,1);
slot = Fetch&Add(FF,1); occupied to avoid overrun;
ififslot == n then Fetch&Add(FF,
slot == n then Fetch&Add(FF, -n); -n); Get waits if slot is empty
ififslot > n then slot = slot - n;
slot > n then slot = slot - n; No task has is “0”
while TD[slot] != 0 do wait(rand());
while TD[slot] != 0 do wait(rand());
TD[slot]
TD[slot]==a; a; Get_Task
Get_Task{{var vartemp;
temp;
}}
slot = Fetch&Add(NA,1);
slot = Fetch&Add(NA,1);
ififslot
slot==
==nnthen
thenFetch&Add(NA,-n);
Fetch&Add(NA,-n);
ififslot
slot > n thenslot
> n then slot==slot
slot--n;
n;
while
while TD[slot] == 0 dowait(rand());
TD[slot] == 0 do wait(rand());
temp=TD[slot];
temp=TD[slot];TD[slot]=0;
TD[slot]=0;return
returntemp;
temp;
}}
Shared Memory
Shared memory was claimed to be a poor model
because it is does not scale
– Many vendors have sold small shared memory
machines
• Some like SMPs work well (but modeled poorly by PRAM)
• Some never worked -- KSR
• Some worked because of a technology opportunity -- slow
processors with a “fast interconnect”
• Some work on small scale, but not beyond 64 processors
and everyone tries to ignore that fact -- Origin-2000
– Many researchers have come up with great ideas,
but they still remain unproved
2
Citation
3
Architecture of an SMP
• A symmetric multiprocessor (SMP) is a set of
processor/cache pairs connected to a bus
• The bus is both good news and bad news
• The (memory) bus is a point at which all processors can
“see” memory activity, and can know what is happening
• A bus is used “serially,” and becomes a “bottleneck,”
limiting scaling
P0 P1 P2 P3
Recall Caches
• Cache blocks (lines) contain several words
• Blocks have state
– Valid Cache
– Invalid ...
addr ...
– Dirty = diff from mem
...
• Cache writing
– Write through means update memory on all writes
– Write back means wait and update when block is
invalidated
– “allocate” vs “no-allocate”
4
Cache Coherence -- The Problem
• Processors can modify shared locations
without other processors being aware of it
unless special hardware is added
5
Cache Coherence -- The Problem
• Processors can modify shared locations
without other processors being aware of it
unless special hardware is added
6
Cache Coherency -- The Goal
A multiprocessor memory system is coherent if
for every location there exists a serial order
for the operations on that location consistent
with the results of the execution such that
• The subsequence of operations for any processor are in
the order issued
• The value returned by each read is the value written by
the last write in serial order
P1 P2 P3
p1:i,
p1:i,p3:j,
p3:j,p2:k,
p2:k,p1:i+1,
p1:i+1,p3:j+1,
p3:j+1,...
... a45 4
a45 Memory
w:a r:a
Write Serialization
Implied
Impliedproperty
propertyof
ofCache
CacheCoherency:
Coherency:
Write
WriteSerialization
Serialization…
…all
all writes
writestotoaalocation
location
are seen in the same order by all processors
are seen in the same order by all processors
7
Snooping To Solve Coherency
• The cache controllers can “snoop” on the
bus, meaning that they watch the events on
the bus even if they do not issue them, noting
any action relevant to cache lines they hold
• There are two possible actions when a
location held by processor A is changed by
processor B
• Invalidate -- mark the local copy as invalid
• Update -- make the same change B made
The
Theunit
unitof
ofcache
cachecoherency
coherencyisisaacache
cacheline
lineor
orblock
block
Snooping
When the cache controller “snoops” it sees
requests by its processor or bus activity by
other processors that is not local to them
P1 P2 P3
Activity from processor
Activity form others
Memory
8
Snooping At Work I
By snooping the cache controller for processor
P3 can take action in response to P1’s write
PP1 reads a into its cache
1 reads a into its cache P1 P2 P3
PP3 reads
readsaainto
intoits
itscache
cache
3
a: 4 5 4
PP1 changes a to 5 and
1 changes a to 5 and
writes
writes throughto
through tomain
main
memory; P33 seesthe
memory; P sees the a: 4 5 Memory
action
actionand
andinvalidates
invalidatesthe
the
location
location
PP3
3
Snooping At Work II
By snooping the cache controller for processor
P3 can take action in response to P1’s write
PP1 reads a into its cache
1 reads a into its cache P1 P2 P3
PP3 reads
readsaainto
intoits
itscache
cache
3
a: 4 5 45
PP1 changes a to 5 and
1 changes a to 5 and
writes
writes throughto
through tomain
main
memory; P33 seesthe
memory; P sees the a: 4 5 Memory
action
actionand
andinvalidates
invalidatesthe
the
location
locationor
orupdates
updatesitit
PP3
3
9
Write-through Coherency
• State diagrams show the protocol
10
Partial Order On Memory Operations
R R R W R R W
R R R R
R R R R R R
Memory Consistency
• What should it mean for processors to see a
consistent view of memory?
• Coherency is too weak because it only
requires ordering with respect to individual
locations, but there are other ways of binding
values together PP0 :: [a,
[a, flag
flag initially
initially 0]
0]
0
aa :=
:= 1;
1;
flag
flag :=:= 1;
1;
Coherency requires PP1 ::
only that the 0 --> 1 1
11
Basic Write-back Snoopy Cache Design
• Write-back protocols are more complex than
write-through because modified data remains
in the cache
• Introduce more cache states to handle that
• Modified, or dirty, the value differs from memory
• Exclusive, no other cache has this location
• Consider an MSI protocol with three states:
• Modified -- data is correct locally, different from memory
• Shared (Valid) -- data at this location is correct
• Invalid -- data at this location not correct
MSI Protocol
• Rdx means that the
cache holds a PrRd/-- PrWr/--
modified value of M
the location and
PrWr/BusRdx
asks for exclusive
BusRd/Reply
permission to read
PrWr/BusRdx S BusRdx/Reply
PrRd/BusRd PrRd/-
• Reply means put
BusRd/-
the value on the bus
BusRdx/--
for another
processor to read I
12
MSI Protocol In Action
Proc
Proc Data
Data
Action P0 P1 P2 Bus From
Action P0 P1 P2 Bus From
P0:r
P0:r aa SS -- -- BRd
BRd Mem
Mem
PrRd/-- PrWr/--
P2:r
P2:r aa SS -- SS BRd
BRd Mem
Mem
P2:w
P2:w aa II -- MM BRdx
BRdx
M
P0:r
P0:r a S - S BRd P2
a S - S BRd P2 PrWr/BusRdx
P1:r a S S S BRd
P1:r a S S S BRd Mem Mem BusRd/Reply
PrWr/BusRdx S BusRdx/Reply
PrRd/BusRd PrRd/-
BusRd/-
BusRdx/--
Critique of MSI
Bad: 2 bus ops to load
PrRd/-- PrWr/--
and update a value even
M
without any sharing
PrWr/BusRdx
BusRd/Reply
PrWr/BusRdx S BusRdx/Reply
Proc
Proc Data
Data PrRd/BusRd PrRd/-
Action P0 Pi Bus From
Action P0 Pi Bus From BusRd/-
P0:r BusRdx/--
P0:r aa SS -- BRd
BRd Mem
Mem
P0:w a M - BRdx
P0:w a M - BRdx I
13
Break
PRd/-- PWr/--
Illinois Protocol M
P=processor
B=Bus PWr/-
Rd=Read PWr/BRdx BRd/Rp
Rdx=Read Ex E BRdx/Rp
RdS=Read Sh
RdS'=Read Ex PrRd/-
PWr/BRdx BRdx/Rp'
Rp=Reply BRd/Rp
Rp=Reply Someone
PRd/BRdx'
S
PRd/BRd BRdx/Rp'
Proc Data PrRd/-
Proc Data
BRd/Rp'
Action
Action P0
P0 Pi
Pi Bus
Bus From
From
P0:r BRdx'x' Mem
P0:r aa EE -- BRd Mem
P0:w a M - I
P0:w a M -
14
Alternative … Updating
• One caching issue is “invalidation” vs “update”:
Dragon
E SC SM M
Proc
Proc Data
Data
Action
Action P0 P1 P2 Bus From
P0 P1 P2 Bus From
P1:r
P1:r aa EE -- -- BRd BRd Mem
Mem
P3:r a Sc - Sc BRd
P3:r a Sc - Sc BRd Mem Mem
P3:w
P3:w aa Sc
Sc -- SmSm Bupd
Bupd P3
P3
P1:r a Sc - Sm null
P1:r a Sc - Sm null - -
P2:r
P2:r aa Sc
Sc Sc
Sc Sm
Sm BRd
BRd P3P3
Invalidation vs Update
1 Repeat k times: P1 writes V, P2-Pp read V
… perhaps representing work allocation
2 Repeat k times: P1 writes V M times, P2 reads
… perhaps representing sharing pair
invl = 6B, update = 14B, miss = 70B
P = 16, M = 10, k = 10
U1:
U1:1,260B
1,260B I1:10,624B
I1:10,624B
U2: 1,400B
U2: 1,400B I2:
I2: 824B
824B
15
Implications of Blocksize
True/False Sharing
• If two processors reference the same cache
line and the same word, they are “truly”
sharing
• If two processors reference the same cache
line but a different word, they are “falsely”
sharing
Cache
...
addr ...
...
P0 references P1 references
16
Discussion
P0 P1 P2 P3 P4 P5 P6 P7
Memory
C A B
Summary
• SMPs solve shared memory by snooping
• Key to SMP’s success is the bus, a site for
serializing memory references
• Buses work, but only for a small number (64
is upper limit, but fewer is better) of
processors
• Relative to the two requirements of shared
memory -- acceptable costs, coherency -- the
SMP meets both
17