V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
Content
dataflow architectures
systolic architectures
circuit model
graph model
LogP
g
LogGP,
message passing
message-passing
paradigm
levels of parallelism
Dataflow architectures
Aim:
make the aspects of a parallel com. explicit at the machine level
The idea:
The program is represented by a graph of essential data dependences
If present, the matching token is extracted & the instruction is issued for exec
If not, the token is placed in the store to await its partner.
Dataflow architectures
Systolic Architectures
Systolic Architectures
Circuit Model
communication step time much larger than that of the mesh- and torusb
based
d architectures.
hit t
Circuit Model
In the case of 2D meshes, the area lower bound will be linear in the
number p of processors.
Power consumption
p
of digital
g
circuits is another limiting
g factor:
Power dissipation in modern microprocessors
grows ~ linearly with the product of die area & clock frequency (both rising)
today stands at a few tens of watts in high-performance designs.
Graph Models
A distributed-memory
di t ib t d
arch.:
h characterized
h
t i d primarily
i
il b
by th
the network
t
k
The network is usually represented as a graph
vertices corresponding to processormemory nodes and
edges corresponding to communication links
links.
2.
Bisection (band)width:
3.
for algorithm
g
design
g must be done virtually
y from scratch for each new architecture.
? abstract away the effects of the interconnection topology (as PRAM for globalmem. mach.) in order to free the alg. designer from a lot of machine-specific details.
have been shown to capture the effect of interconnection topology fairly accurately.
2.
3.
4
4.
L: an upper bound on the latency, or delay, incurred in sending a message from its
source processor to its target processor.
o: the
th overhead:
h d length
l
th off time
ti
that
th t a processor is
i engaged
d iin th
the ttransmission
i i or
reception of each message - during this time the proc.cannot perform other ops.
g: the gap between messages: the minimum time interval between consecutive
message transmissions or consecutive message receptions at a processor.
P: the number of processors.
processors
LogP model
at most L/g mess can be in transit from any processor or to any processor at any time.
If a processor attempts to transmit a message that would exceed this limit, it stalls unti
the message can be sent without exceeding the capacity limit
limit.
processors work asynchronously, and the latency experienced by any message is
unpredictable but is bounded above by L in the absence of stalls.
Algs. that communicate data infrequently: ignore the bandwidth and capacity limits.
If messages are sent in long streams pipelined through the network (transmission time
is dominated by the inter-message gaps) the latency may be disregarded.
In some MPPs the overhead dominates the gap, so g can be eliminated.
Idea: all processors that have received the data unit transmit it as quickly as
possible, while ensuring that no processor receives more than one message.
The source of the broadcast begins transmitting the data unit at time 0.
The first data unit enters the network at time o, takes L cycles to arrive at the
destination, and is received by the processor at time L + 2o.
Meanwhile the source will initiate transmission to other procs at time g,
g 2g,,
2g
Assuming g o, each of which acts as the root of a smaller broadcast tree.
The optimal broadcast tree for p processors is unbalanced with the fan-out at
each node determined by the relative values of L, o, and g.
Figure : optimal broadcast tree for P = 8, L = 6, g = 4, and o = 2.
No./node: time at which it has received the data unit and can begin sending it on.
Proc. overhead of successive transmissions overlaps delivery of previous messages.
Procs may
y experience
p
idle cycles
y
at the end of the algorithm
g
while the last few
messages are in transit.
1.
2.
Otherwise
the last step performed by the root processor (at time T - 1) is to add a value it has
computed locally to a value it just received from another processor.
Remote proc must have sent the value at time T - 1 - L - 2o, and we assume
recursively that it forms the root of an optimal summation tree with this time bound.
(see the textbook)
Principles
The paradigm
Th
di
iis one off th
the oldest
ld t and
d mostt widely
id l used
d approaches
h ffor
programming parallel computers.
2.
the process th
th
thatt has
h the
th d
data
t and
d
the process that wants to access the data.
Each data element must belong to one of the partitions of the space
its roots can be traced back in the early days of parallel proc.
its wide-spread adopted
the p
programmer
g
is fully
y aware of all the costs of nonlocal interactions, and is
more likely to think about algorithms (and mappings) that minimize interactions.
paradigm can be efficiently implemented on a wide variety of architectures.
Disadvantage:
For dynamic and/or unstructured interactions the complexity of the code written
for this type of paradigm can be very high for this reason.
Programming issues
Paradigm
a ad g suppo
supports
ts e
execution
ecut o o
of a d
different
ee tp
program
og a o
on eac
each o
of tthe
epp
processes.
ocesses
In SPMD programs the code executed by different processes is identical except for a small no
processes (e.g., the "root" process).
In their simplest
p
form, the p
prototypes
yp of these operations
p
are
send(void *sendbuf, int nelems, int dest)
receive(void *recvbuf, int nelems, int source)
P1
receive(&a,
i (& 1
1, 0)
printf("%d\n", a);
allow the transfer from buffer memory to desired location without CPU intervention.
allows copying of data from one memory location to another without CPU support
If send returns before the communication operation has been accomplished
accomplished, P1
might receive the value 0 in a instead of 100!
1
1.
The sending operation blocks until it can guarantee that the semantics
will not be violated on return irrespective of what happens in the
program subsequently.
There are two mechanisms by which this can be achieved:
1. Blocking Non-Buffered Send/Receive.
2. Blocking Buffered Send/Receive.
Blocking Non-Buffered Send/Receive
The send operation does not return until the matching receive has
been encountered at the receiving process.
Then the message is sent and the send operation returns upon
completion
l ti off th
the communication
i ti operation.
ti
Involves a handshake between the sending and receiving processes.
(a)
(b)
(c)
In an asynchronous environment
environment, this may be impossible to predict
predict.
The sender
has a buffer preallocated for communicating messages.
copies the data into the designated buffer
returns after the copy operation has been completed.
continue with the program knowing that any changes to the data will
not impact program semantics.
The actual communication can be accomplished in many ways
depending on the available hardware resources.
If the hardware supports asynchronous comm. (independent of the
CPU) then a network transfer can be initiated after the mess
CPU),
mess. has
been copied into the buffer.
Receiving end:
the data is copied into a buffer at the receiver as well.
When the receiving process encounters a receive operation, it checks
to see if the message is available in its receive buffer.
If so, the data is copied into the target location.
a process wishing to send data to another simply posts a pending message and returns to
the user program. The program can then do other useful work.
When the corresponding receive is posted, the communication operation is initiated.
When this operation is completed, the check-status op. nindicates that it is safe for the
programmer to touch this data
Levels of parallelism
Levels of
parallelism
that are
possible
within a
single
computer
p g
program
Levels of
parallelism
combined with
the basic
parallel
processor
configurations
Parallelism levels
1.
2.
3.
4.
Microparallelization
takes place inside a single processor
does not require the intervention of the programmer to implement.
Medium-grain parallelization
associated with language supported or loop level parallelization.
While some headway has been made in automating this level of
parallelization with optimizing compilers, the results of these attempts
are only moderately satisfactory.
Coarse-grain parallelization
associated with distributed memory parallel computers
is almost exclusively introduced by the programmer.
Grid-level parallelization
currently the focus of intensive research
very promising model for solving large problems,
its applicability is limited to certain classes of computational probls,
belonging to the large-scale embarrassingly parallel category.
Microparallelism
2.
Medium-grain Parallelism
Coarse-level parallelism
Coarse-level parallelism
Grid Parallelism
Extremely
E
t
l h
heterogeneous
t
system,
t
Requires the coarsest level of parallelization:
Examples
p
of successfully
y tested tasks: