0% found this document useful (0 votes)
18 views99 pages

MULTIPROCTLPA

The document discusses multiprocessor architectures, focusing on the challenges and advantages of various designs, including shared and distributed memory systems. It highlights the importance of cache coherence protocols and the role of thread-level parallelism in improving performance for data-intensive applications. Additionally, it covers the evolution of multicore processors and the complexities of communication among processors in multiprocessor environments.

Uploaded by

Ral Ralte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views99 pages

MULTIPROCTLPA

The document discusses multiprocessor architectures, focusing on the challenges and advantages of various designs, including shared and distributed memory systems. It highlights the importance of cache coherence protocols and the role of thread-level parallelism in improving performance for data-intensive applications. Additionally, it covers the evolution of multicore processors and the complexities of communication among processors in multiprocessor environments.

Uploaded by

Ral Ralte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

• Diminishing returns in exploiting ILP

• Concern over power


• Interest in servers & their performance
• Data-intensive applications
• Improvement in usage of multiprocessors
 Problems:
1. Multiprocessor architecture – large &
diverse field – youth – failing than
succeeding.
2. Broad coverage would necessarily entail
discussing approaches – may not stand
test of time.

Focus :Main stream of multiprocessor


design – small to medium no. of
processors (4 to 32)
Multiprocessors
 Computers consisting of tightly coupled processors
whose coordination and usage are typically controlled
by a single OS and that share memory through a
shared address space.
 Exploit thread-level parallelism through two different

software models.
◦ parallel processing -- execution of a tightly coupled set of
threads collaborating on a single task.
◦ Request- level parallelism -- execution of multiple, independent
processes that may originate from one or more users,
 RLP exploited by a single application running on multiple
processors, such as a database responding to queries, or multiple
applications running independently, often called multiprogramming
1. Single instruction stream, single data
stream (SISD) - Uniprocessor

2. Single instruction stream, multiple


data streams (SIMD) – same inst exec
by multiple processors with diff data
streams – data-level parallelism. Each
processor – own data memory but single
inst. mem. and control processor.
3. Multiple instruction streams, single
data stream (MISD) - NIL

4. Multiple instruction streams, multiple


data streams (MIMD) – Each processor
fetches its own inst and operates its own
data. Thread level parallelism - flexible
1. Uses Thread Level Parallelism - architecture
choice for general purpose multiprocessors.
2. Offers flexibility - single-user multiprocessor
or multi-user multiprocessor
3. Can build on cost- performance advantages
of off-the-shelf processors

 Popular class of MIMD computers – clusters –


standard nodes and standard technology.
 A computer cluster is a group of linked
computers, working together closely thus
forming a single computer.

 The individual nodes are either commodities


or customized

 Commodity clusters
 Custom clusters
 A class of clusters where the nodes are truly
commodities
◦ headless workstations, motherboards, or blade
servers
◦ connected with a SAN or LAN usually
accessible via an I/O bus.
 Focus on Throughput

 no comm. among threads

 Assembled by users or computer centre


directors.
 A cluster architecture where the nodes and
the interconnect are customized and more
tightly integrated than in a commodity
cluster.
 distributed memory or message passing
multiprocessors.

 Parallel applications – exploit large amounts


of parallelism in a single problem

 comm. reqd. among threads


 1990’s – increasing capacity of single chip –
multiple processors on a die – on-
chip/single chip multiprocessing – multicore

 Multicore – multiple processor cores on a


single die.

 Typically share some resources - 2nd or 3rd


level cache / memory , I/O buses.
 MIMD – each processor – own inst. stream –
each processor – execs – diff process.

 Process – segment of code that runs


independently.

 Useful – multiple processors - exec a single


program - share code & address space –
THREAD.

 Advantage of MIMD processor – n


processors – at least n threads/process to
execute.
 Threads – independent processes or a few
tens of iterations of a loop by a parallel
compiler.

 Exploit TLP effectively – grain size of thread


(amount of computation given to thread).

 Threads also used to exploit data level


parallelism – but overhead>> than in SIMD
computer.
 MIMD falls into 2 categories :
1. Centralized shared memory multiprocessor
2. Distributed memory multiprocessor

 Depending on no. of processors involved 


memory organization, interconnect strategy
Processor Processor Processor
Processor

1 or more 1 or more 1 or more 1 or more


levels of levels of levels of levels of
cache cache cache cache

Main Memory I/O System


 Few dozen processor chips (<100)

 Single centralized memory

 Large caches

 Becomes less attractive as the no. of


processors increases.
 Single main mem – symmetric relationship
to all processors – uniform access time from
any processor – Symmetric multiprocessors
(SMPs)

 Architecture –Uniform memory access(UMA)


– uniform latency from memory
P+C P+C P+C P+C

Mem I/O Mem I/O Mem I/O Mem I/O

Inter connection Network

Mem I/O Mem I/O Mem I/O Mem I/O

P+C P+C P+C P+C


 Large no. of processors.

 Memory distributed not centralized.

 Each processor – some memory, I/O ,


interface to N/W.
• Advantages:
• Cost-effective way to scale memory bandwidth
– most accesses are to local memory in the
node.
• Reduces latency for accesses to local memory.

• Disadvantages:
• Communication b/w processors – complex.
• Requires more effort in s/w to take advantage
of the increased memory bandwidth.
1. Communication through shared address
space.
2. Address space consist of multiple private
address spaces.
 Physically separate memories can be
addressed as one logically addressed space.

 Called as distributed-shared memory


architectures
 shared memory – address space is shared
i.e. same physical address on 2 processors
refer to same location in memory.

 Non-Uniform memory access (NUMA) –


access time depends on location of a data
word in memory.
 Address space can consist of multiple
private address space –

 logically disjoint & can't be addressed by a


remote processor.

 In such processors, same physical address


on 2 diff processors refers to 2 diff locations
in 2 diff memories.
1. Limited parallelism

2. High cost of communications.


 Large multilevel caches- reduce memory
bandwidth
 Multiple processors able to share same

memory
 Small scale multiprocessors –several

processors shared a single physical memory


connected by a shared bus.
 Cost-effective.
 Earlier, processor & cache on a board.

 Next versions, 2 to 4 processors per board,


multiple buses & interleaved memories to
support faster processors.

 IBM-1, AMD & Intel -2, Sun T1-8

 Symmetric shared-memory machines


usually support the caching of both the
shared and private data.
 Private data  Shared data
 single processor  multiple processors

+ comm among
them by reads &
writes.
 Cached – location  Cached – replicated
migrated to cache in multiple caches
 Reduce avg access  Reduce access
time , memory latency, memory
bandwidth reqd. bandwidth reqd.
 No other processor  Reduces contention
uses –like in  New problem :
uniprocessor Cache Coherence.
Time Event Cache Cache loc.
for for X
CPU CPU
A B
0 1
1 CPU A reads X 1 1
2 CPU B reads X 1 1 1
3 CPU A Stores 0 into X 0 1 0

•Memory system is coherent if any read of a


data item returns the most recently written value
of that data item.
 2 different aspects : a) Coherence
b) Consistency

 Coherence : defines what values can be


returned by a read.

 Consistency : determine when a written


value will be returned by a read
• P1: A read by a proc P to loc X
that follows a write by P to X, with no
writes of X by another proc b/w write
and read by P, always returns the
value written by P.
• P2: A read by a proc to loc X that
follows a write by another proc to X
returns the written value if read & write
are sufficiently separated in time & no
other writes to X occur b/w the 2 accesses.
 P3:Writes to the same loc are serialized : 2
writes to same loc by any 2 proc are seen in
same order by all processors.
 P1- preserves program order
 P2- defines coherent view of memory – if

proc continuously reads old data –


incoherent
 P3 ensures write serialization property.
 When a written value is seen?
 Ex: a write of X on one proc precedes a read

of X on another proc by a very small time –


impossible to ensure that read returns the
written value.
 These issues dealt in memory consistency

model.
 Both are complementary
 Coherence – defines behavior of reads &

writes to same meory loc


 Consistency - defines behavior of reads &

writes w.r.t to accesses to other memory


locations.
1. A write does not complete – until all proc’s
have seen the effect of that write.
2. Proc doesn’t change the order of any write
w.r.t an other memory access.

Summ: If a proc writes loc A followed by loc B


– any proc that sees the new value of B
must also see the value of A

i.e. Reorder reads but writes in program order.


 Migration – data item moved to local cache
and used

 Replication - Cache makes a copy of data


item in local cache
 Support to migration & replication-
performance in accessing shared data.

 Rather than trying to solve it in s/w, try to


adopt a h/w solution – protocol to maintain
cache coherences.
 Cache coherence protocols –
◦ track state of any sharing of a data block.
 Directory based:
 sharing status of a block of physical memory

is kept in just one location


 directory
 higher implementation overhead
 scale to larger processor counts.
• Every cache that has a copy of the data
from a block of physical memory also has a
copy of sharing status of the block.

• No centralized state

• All caches accessible via some broad cast
medium (bus/switch)

• All cache controllers snoop on the medium


to determine whether or not they have a
copy of the block that is requested on a
bus or a switch access.
1. Write invalidate protocol

2. Write update / write broadcast protocol.


 Ensure proc has exclusive access to a data
item before it writes that item.
 It invalidates other copies on a write.
 Common protocol for snooping & directory
schemes.
Proc activity Bus activity CPUA CPUB Mem loc
cache cache X

CPU A reads Cache miss for X 0 0


X

CPU B Cache miss for X 0 0 0


reads X

CPU A Invalidation for X 1 0


writes 1 to X

CPU B Cache miss for X 1 1 1


reads X
 Write requires exclusive access
 And copy of read proc- invalidated.
 Read occurs – misses in the cache & forced
to fetch new copy of data.
 Write occurs – exclusive access – prevents
other procs to write simultaneously.
 If 2 try to write simultaneously – 1 wins.
 Update all cached copies of a data item
when that item is written.
 Broadcast all writes to shared cache lines
 More bandwidth.
 Symmetric shared memory architecture

 Distributed shared memory architecture


 Cache coherence
 Directory based & snooping protocols
 Different state transitions in cache

coherence
 Snooping protocol
– communication with all caches on every
cache miss.
- Centralized data structure – inexpensive
- but no scalability

 alternative – Directory Protocol


 State of every cache block in directory
 Copy of dirty bit
 Reduced bandwidths
 More no. of processors – bottleneck – b’coz

directory maintain info. of all processors


 Directory distributed along with memory but

sharing status in a single location.


 Implement 2 operations:
◦ 1. Handle read miss
◦ 2. Handle write to shared cache block
◦ 3. Handling write miss is a combination of both 1)

and 2).

 To do the above operations, it must track


state of each cache block.
 A cache miss refers to a failed attempt to
read or write a piece of data in the cache,
which results in a main memory access
with much longer latency.

 There are three kinds of cache misses:


instruction read miss, data read miss, and
data write miss.
 A cache read miss from an instruction cache
generally causes the most delay, because the
processor, or at least the thread of execution, has
to wait (stall) until the instruction is fetched from
main memory.
 A cache read miss from a data cache
usually causes less delay, because
instructions not dependent on the cache
read can be issued and continue execution
until the data is returned from main
memory, and the dependent instructions
can resume execution.
 A cache write miss to a data cache generally
causes the least delay, because the write can be
queued and there are few limitations on the
execution of subsequent instructions. The
processor can continue until the queue is full.
 In a simple snooping coherence protocol the
states are:

a) Shared – 1or more proc. have block


cached, value in memory is up to date

b) Uncached/Invalid – No processor has


copy of cache block

c) Modified – Exactly one proc. has copy, it


has written in to it ,memory copy out of
date. Owner of the block.
P+C P+C P+C P+C

Mem I/O Mem I/O Mem I/O Mem I/O

Directory Directory Directory Directory

Inter connection Network


Directory Directory Directory Directory

Mem I/O Mem I/O Mem I/O Mem I/O

P+C P+C P+C P+C


 Track state of each cache block.
 Also, track processors that have copy of
that block – invalidate them on a write.
 Simplest way to do – keep bit vector for
each block
 Bit Vector :
-- whether the processor has copy of that
block (shared)
-- keep track of the owner of the block
(exclusive)
-- track state of each cache block at
individual caches
 States and transitions are all same as
snooping, actions are slightly different.
 Invalidating or locating a exclusive copy of a
data item are different.
 They both involve comm. b/w requesting
node and directory and b/w directory and 1
or more nodes.
 In a snooping protocol both the steps are
combined by use of broadcast to all nodes.
 Before going to protocol diagrams lets see
the different type of messages sent
between processors and directories for the
purpose of handling misses and maintaining
coherence.
 Local Node – node where a request
originates.
 Home Node – node where mem loc and
directory entry of an address reside.
 Remote Node – node that has copy of cache
block either exclusive/shared

 A remote node maybe same as either


local/home node – then – basic protocol
doesn’t change but interprocessor msgs
replaced with intraprocessor msgs.

 P= requesting processor no; A=requested


address, D =data contents
Msg type Source Dest Msg Function of msg

Read miss Local Home dir P,A Proc P has RM at addr A; req data &
cache make P a read sharer
Write miss Local Home dir P,A Proc P has WM at addr A; req data &
cache make P exclusive owner

Invalidate Local Home dir A Req to send invalidates to all remote


cache caches that are caching the block at addr
A
Invalidate Home dir Remote A Invalidate a shared copy of data at addr A
cache

fetch Home dir Remote A Fetch the block at addr A, send it to its
cache home dir, change A from remote cache to
shared

fetch Home dir Remote A Fetch the block at addr A, send it to its
/Invalidate cache home dir, invalidate the block in cache

Data value Home dir Local D Return a data value from home mem
reply cache

Data write Remote Home dir A,D Write back a data value for address A
back cache
 Assume a simple model of memory
consistency
 Min. complexity have these assumptions:
 Msg will be received & sent in same order
 To ensure – invalidates sent by proc are

honored before new messages are


transmitted. (just like snoopy)
 Basic states are same as in snooping
protocol i.e.; invalid, shared, modified
 State transition diagram doesn’t represent

all the details of coherence protocol


 Actual controller is highly dependent on

no.of details like msg delivery properties,


buffering structure …
Cpu read hit
invalidate
Invalid Cpu read
s Shared
s (Read only)
Send read miss message dmi
rea
Cpu write ck; a ck h it g Cpu read miss
a b t e s
eb ite ri
Send write miss msg

m
Data write back

r i t w r w i s s
w ta c pu m Read miss
ta D a ite
Da g w r
m s n d
a te Se
d lid
r ea t ch a
u Fe nv iss
Cp ss nd
i m Requests
i e i t e
m S wr
P U Read miss
C
CPU write hit Write miss
CPU read hit
Modified
(read/write) CPU write miss Invalidate
Data write back
Data fetch
write miss
 Write miss op, which was broadcast on bus
in snooping; replaced by data fetch &
invalidate operations ; selectively sent by
directory controller.
 Directory implements other half of

coherence protocol.
 Msg sent to directory causes 2 actions:

i) update directory state


ii) send additional msg to satisfy request.
 States of memory block:
a) uncached by any node
b) shared – cached in multiple nodes &
readable
c) exclusive – cached exclusively & writable
exactly in 1 node

 Also, directory must keep track of


processors that have a copy of cache block
– ‘sharers’
Data value reply sharers = {P}
read miss Shared
Uncached
{P} (Read only)
ers + iss RM
r m
sha rite
er s= w ;
ly
a r ep Data value reply
sh er
sharers = {P}
Data value reply

;
sharers ={ }

y lu
s re pl a va
is ue at
m al } ;d
e ad t a v {P sharers =
R d a =
h; r e rs sharers + {P}
t c
Fe ; sha
a te
al id
inv Fetch/invalidate Requests
Exclusive Data value reply Read miss
(read/write)
Write miss
WM
sharers = {P}
Data write back Data write back
Block Status Req Process
State
Un cached Copy in RM Req proc gets data from mem – requestor
mem is is only sharing node – state – shared
current WM Req proc gets value & sharing node – state
value – excluisve -owner
Shared Mem value RM Req proc gets data from mem – req proc
up to date added to sharing set
Req proc gets data from mem – all sharers
WM sent invalidate msg – state -exclusive

Exclusive Current RM Owner is sent a data fetch msg


value with Owners state to shared – sends data to dir-
owner mem-P sharers =sharers + P
WM Block has new owner –old owner
invalidate-send value to dir –req P -
sharers ={ P}
Owner proc is replacing the block – must
DWB write it back – makes mem copy up to date
state- uncached , sharers ={}
 Mechanism: Inst sequence which can
atomically retrieve (read) & change
(modify) values

 Synchronization operations:
1. Atomic Exchange
2. Test & Set
3. Fetch & Increment
 Interchanges a value in a register for a value
in memory.

 Build a simple lock where exchange


operation is atomic

◦ Lock -0 if lock is free


◦ Lock -1 if lock is unavailable

 Two processors try to do exchange


simultaneously, first returns 0 , second
returns 1.
 Similar to atomic exchange
 Tests a value and sets it if the value passes
the test.
 Test for an operation and if it passes set to 1
else to 0
 Returns the value of a memory location and
atomically increments it.

 Value 0 to indicate synchronization variable


is unclaimed
 Implementing single atomic memory
operation have some challenges.

 Since it requires both memory read & write in


a single uninterruptible instruction(Atomic).

 This complicates implementation of


coherence.

 Because, hardware can't allow any


operations b/w read & write and yet must not
deadlock.
 Alternative solution :
◦ have pair of instructions where 2nd instruction
returns a value from which it can be deduced
whether pair of inst.s were executed as if the
inst.s are atomic
 When an inst pair is effectively atomic, no
other processor can change the value
between the instruction pair.
◦ Any other inst. – it is either before the pair or after
the pair.

 Pair of instructions:
1. load linked/load locked
2. store conditional
 If contents of mem loc specified by load
linked are changed before store conditional
to the same address occurs
 If the processor does a context switch b/w
2 inst.
 Store conditional is defined to return 1 if
successful, 0 otherwise.
 Spin locks – locks that a processor
continuously tries to acquire, spinning
around a loop until it succeeds.
 When are they used?

- when programmers expect the lock to be


held for a very short amount of time
- when they want the process of locking to
be low latency.
 Keep lock variables in memory
 (No cache coherence)

 Processor could continually try to acquire


lock using atomic operation (exchange)

 Test whether exchange returned lock as free

 To release the lock, processor simply stores


the value 0 to the lock.
DADDUI R2,R0,#1
 lockit : EXCH R2,0(R1) ; atomic exchange
BNEZ R2,lockit ; already locked?
 If our multi processor supports cache
coherence, we can cache the locks using
coherence mechanism to maintain the lock
value coherently.
 Coherence – ensures that multiprocessors
see a consistent view of memory.
◦ But, how consistent are they?

 Consistent – when must a processor see a


value that has been updated by another
processor ?

 Sequential consistency model


 Sequential consistency: result of any
execution be same as if:
◦ Accesses executed by a processor are in order
◦ Accesses among different processors are
interleaved

 That is, there exists some interleaving


which will lead to the same result on a
uniprocessor
◦ Or, a multi-processor with no caches, and no write-
buffers, and only a single centralized memory
Need to guarantee that a write/read
completes before any other access (by the
same processor)
 Write completes == all invalidations have
reached
 This implies that write buffers cannot be
used (writes cannot be delayed in general)
1. Developed ambitious implementations that
preserve sequential consistency but use
latency –hiding techniques to reduce
penalty

2. Less restrictive consistency models that


allow for faster hardware.
 Sequential consistency :
 Disadvantage : performance
 Advantage of Programmer - simplicity

 Efficient implementation of sequential


consistency - assume – programs are
synchronized
 Programs which protect access to shared
locations through synchronization
operations.

 More formally:
◦ In every possible execution, for every shared data
◦ Write by a processor, and access (read/write) by
another processor
◦ Are separated by a pair of synchronization
operations
◦ One executed after the write and one before the
access by the 2nd processor.


 That is, the program is data-race-free

 Data race – cases where variables may be


updated without ordering by
synchronization- outcome is un predictable
– depends on relative speed of processors.

 broadly accepted observation: most


programs are synchronized
 Sequential consistency guarantees
uniprocessor- like behavior for any program
◦ True for synchronized programs too
 But sequential consistency is not necessary
for uni-processor-like behaviour of
synchronized programs

 Define looser consistency models


◦ Can be implemented more efficiently than
sequential consistency
 Key idea –
◦ allow reads and writes to complete out of order
◦ Use synchronization operations to enforce
ordering
◦ So prgm behaves like sequentially consistent

 X Y
◦ X must complete before Y is done
 Four possibilities:
◦ R --> R, R --> W, W --> W, W --> R

 Sequential consistency guarantees all four


orderings are preserved
1. Total store ordering / processor consistency
(W  R)
2. Partial store order(W W)
3. Weak ordering (R W and RR)
1. Amdahl’s law doesn’t apply to parallel
computers
2. Linear speedups are needed to make
multiprocessors cost-effective
3. Scalability is almost free
1. Measuring performance of multiprocessors
by linear speedup versus execution time.
2. Not developing the s/w to take advantage
of, or optimize for, a multiprocessor
architecture
CPU
read
CPU read hit

place read miss on bus Shared


Invalid
(Read only)
RM
b us
ss
mi o n
d ss
on bus
Place write miss
Cpu write

r ea m i place
U d us
CP rea b read
lace s on
p s miss on
e mi
w rit bus
l ace
t e;p
rite alida
U w inv
CP
Exclusive
(read/write)
WM
Cpu read hit Place write miss on
bus
Write miss
invalidate for this block Shared
Invalid
(Read only)
RM

place
read
Write back

miss on
memory
access

bus
Abort
block

lo ck ss
b e
a ck acc
b ry
rite em
o
W m
WM ort
Exclusive A b
(read/write)

RM

You might also like