PC 02 Parallel Algorithms
PC 02 Parallel Algorithms
Algorithms
A. Legrand
Parallel Algorithms
October 2, 2011
1 / 235
Outline
Parallel
Algorithms
A. Legrand
2 / 235
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
LogP and
Friends
TCP Part I
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex)
Network Models
Flows
Imperfection
Topology
A Few Examples
Virtual
Topologies
3 / 235
Motivation
Parallel
Algorithms
A. Legrand
I Scientific computing : large needs in computation or storage re-
sources.
P2P
Communication I Need to use systems with “several processors”:
Hockney
LogP and
Friends I Parallel computers with shared/dis- I Clusters of clusters
TCP
Modeling
tributed memory I Network of workstations
Concurency
Multi-port
I Clusters I The Grid
Single-port
(Pure and Full
Duplex)
I Heterogeneous clusters I Desktop Grids
Flows
Imperfection I When modeling platform, communications modeling seems to be
Topology
A Few Examples
the most controversial part.
Virtual
Topologies I Two kinds of people produce communication models: those who
are concerned with scheduling and those who are concerned with
performance evaluation.
I All these models are imperfect and intractable.
4 / 235
Outline
Parallel
Algorithms
5 / 235
UET-UCT
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
LogP and
Unit Execution Time - Unit Communication Time.
Friends
TCP
Modeling Hem. . . This one is mainly used by scheduling theoreticians to prove
Concurency
Multi-port that their problem is hard and to know whether there is some hope to
Single-port
(Pure and Full prove some clever result or not.
Duplex)
Flows
Imperfection Some people have introduced a model whith cost of ε for local com-
Topology munications and 1 for communications with the outside.
A Few Examples
Virtual
Topologies
6 / 235
“Hockney” Model
Parallel
Algorithms
Hockney [Hoc94] proposed the following model for performance eval-
A. Legrand uation of the Paragon. A message of size m from Pi to Pj requires:
P2P
Communication
ti,j (m) = Li,j + m/Bi,j
Hockney
LogP and
Friends In scheduling, there are three types of “corresponding” models:
TCP
I Communications are not “splitable” and each communication k is
Modeling
Concurency associated to a communication time tk (accounting for message
Multi-port
Single-port size, latency, bandwidth, middleware, . . . ).
(Pure and Full
Duplex) I Communications are “splitable” but latency is considered to be
Flows
Imperfection negligible (linear divisible model):
Topology
A Few Examples ti,j (m) = m/Bi,j
Virtual
Topologies
I Communications are “splitable” and latency cannot be neglected
(linear divisible model):
Parallel
Algorithms
The LogP model [CKP+ 96] is defined by 4 parameters:
A. Legrand
I L is the network latency
I o is the middleware overhead (message splitting and packing,
P2P
Communication buffer management, connection, . . . ) for a message of size w
Hockney
LogP and I g is the gap (the minimum time between two packets communi-
Friends
TCP cation) between two messages of size w
Modeling I P is the number of processors/modules
Concurency
Multi-port o o o o
Single-port Sender
(Pure and Full Card
Duplex)
g g g g
Flows
Network L
Imperfection g g g g
Card
Topology
Receiver
A Few Examples o o o o
Virtual
Topologies
I Sending m bytes with packets ofsize w :
2o + L + m w · max(o, g )
I Occupation on the sender and
on the receiver:
o+L+ m w − 1 · max(o, g )
8 / 235
LogP
Parallel
Algorithms
The LogP model [CKP+ 96] is defined by 4 parameters:
A. Legrand
I L is the network latency
I o is the middleware overhead (message splitting and packing,
P2P
Communication buffer management, connection, . . . ) for a message of size w
Hockney
LogP and I g is the gap (the minimum time between two packets communi-
Friends
TCP cation) between two messages of size w
Modeling I P is the number of processors/modules
Concurency
Multi-port o o o o o
Single-port Sender
(Pure and Full Card
Duplex)
g g g g g
Flows
Network L
Imperfection g g g g
Card
Topology
Receiver
A Few Examples o o o o
Virtual
Topologies
I Sending m bytes with packets ofsize w :
2o + L + m w · max(o, g )
I Occupation on the sender and
on the receiver:
o+L+ m w − 1 · max(o, g )
8 / 235
LogGP & pLogP
Parallel
Algorithms
A. Legrand
The previous model works fine for short messages. However, many par-
P2P
Communication allel machines have special support for long messages, hence a higher
Hockney
LogP and
bandwidth. LogGP [AISS97] is an extension of LogP:
Friends
TCP
G captures the bandwidth for long messages:
short messages 2o + L + m
w · max(o, g )
Modeling
Concurency
Multi-port
Single-port
long messages 2o + L + (m − 1)G
(Pure and Full
Duplex) There is no fundamental difference. . .
Flows
Imperfection OK, it works for small and large messages. Does it work for average-
Topology
A Few Examples
size messages ? pLogP [KBV00] is an extension of LogP when L, o
Virtual
Topologies
and g depends on the message size m. They also have introduced
a distinction between os and or . This is more and more precise but
concurency is still not taken into account.
9 / 235
Bandwidth as a Function of Message Size
Parallel
Algorithms m
With the Hockney model: L+m/B .
A. Legrand
1000
P2P Mpich 1.2.6 sans optimisation
Communication Mpich 1.2.6 avec optimisation
Hockney
LogP and
800
Friends
TCP
Modeling
Bande passante [Mbits/s]
Concurency 600
Multi-port
Single-port
(Pure and Full
Duplex)
400
Flows
Imperfection
Topology 200
A Few Examples
Virtual
Topologies
0
1 2 4 8 16 32 64 128 256 1Ko 2Ko 4Ko 16Ko 64Ko 256Ko 1Mo 4Mo 16Mo
Taille des messages
10 / 235
Bandwidth as a Function of Message Size
Parallel
Algorithms m
With the Hockney model: L+m/B .
A. Legrand
1000
P2P Mpich 1.2.6 sans optimisation
Communication Mpich 1.2.6 avec optimisation
Hockney
LogP and
800
Friends
TCP
Modeling
Bande passante [Mbits/s]
Concurency 600
Multi-port
Single-port
(Pure and Full
Duplex)
400
Flows
Imperfection
Topology 200
A Few Examples
Virtual
Topologies
0
1 2 4 8 16 32 64 128 256 1Ko 2Ko 4Ko 16Ko 64Ko 256Ko 1Mo 4Mo 16Mo
Taille des messages
10 / 235
What About TCP-based Networks?
Parallel
Algorithms
A. Legrand
The previous models work fine for parallel machines. Most networks
P2P
Communication use TCP that has fancy flow-control mechanism and slow start. Is it
Hockney
LogP and valid to use affine model for such networks?
Friends
TCP The answer seems to be yes but latency and bandwidth parameters
Modeling have to be carefully measured [LQDB05].
Concurency
Multi-port
Single-port
I Probing for m = 1b and m = 1Mb leads to bad results.
(Pure and Full
Duplex) I The whole middleware layers should be benchmarked (theoretical
Flows
Imperfection
latency is useless because of middleware, theoretical bandwidth is
Topology
useless because of middleware and latency).
A Few Examples
Virtual The slow-start does not seem to be too harmful.
Topologies
Most people forget that the round-trip time has a huge impact on the
bandwidth.
11 / 235
Outline
Parallel
Algorithms
12 / 235
Multi-ports
Parallel
Algorithms I A given processor can communicate with as many other processors
A. Legrand as he wishes without any degradation.
P2P I This model is widely used by scheduling theoreticians (think about
Communication
Hockney all DAG with commmunications scheduling problems) to prove
LogP and
Friends that their problem is hard and to know whether there is some
TCP
hope to prove some clever result or not.
Modeling
Concurency This model is borderline, especially when allowing duplication,
Multi-port
Single-port when one communicates with everybody at the same time, or
(Pure and Full
Duplex) when trying to design algorithms with super tight approximation
Flows
ratios.
Imperfection A
Topology Frankly, such a model is totally unrealistic.
A Few Examples
Virtual I Using MPI and synchronous communica- 1 1
Topologies
tions, it may not be an issue. However, with
multi-core, multi-processor machines, it can- B C
1
not be ignored. . .
Multi-port
(numbers in s)
13 / 235
Bounded Multi-port
Parallel
Algorithms I Assume now that we have threads or multi-core processors.
A. Legrand
We can write that sum of the throughputs of all communications
P2P
Communication
(incomming and outgoing). Such a model is OK for wide-area
Hockney communications [HP04].
LogP and
Friends
TCP
I Remember, the bounds due to the round-trip-time must not be
Modeling forgotten!
Concurency
Multi-port
Single-port
(Pure and Full A
Duplex)
Flows
Imperfection
β/2 β/2
Topology
A Few Examples
Virtual
Topologies B C
β/2
Multi-port (β)
(numbers in Mb/s)
14 / 235
Single-port (Pure)
Parallel
Algorithms
I A process can communicate with only one other process at a time.
A. Legrand This constraint is generally written as a constraint on the sum of
P2P
communication times and is thus rather easy to use in a scheduling
Communication context (even though it complexifies problems).
Hockney
LogP and I This model makes sense when using non-threaded versions of com-
Friends
TCP munication libraries (e.g., MPI). As soon as you’re allowed to
Modeling
Concurency
use threads, bounded-multiport seems a more reasonnable option
Multi-port (both for performance and scheduling complexity).
Single-port
(Pure and Full
Duplex)
Flows A
Imperfection
Topology
A Few Examples
1/3 1/3
Virtual
Topologies
B C
1/3
1-port (pure)
(numbers in s)
15 / 235
Single-port (Full-Duplex)
Parallel
Algorithms
A. Legrand At a given time, a process can be engaged in at most one emission and
P2P
one reception. This constraint is generally written as two constraints:
Communication one on the sum of incomming communication times and one on the
Hockney
LogP and sum of outgoing communication times.
Friends
TCP
Modeling
Concurency A
Multi-port
Single-port
(Pure and Full
Duplex) 1/2 1/2
Flows
Imperfection
Topology B C
A Few Examples 1/2
Virtual
Topologies
1-port (full duplex)
(numbers in Mb/s)
16 / 235
Single-port (Full-Duplex)
Parallel
Algorithms
This model somehow makes sense when using networks like Myrinet
A. Legrand that have few multiplexing units and with protocols without control
flow [Mar07].
P2P
Communication
Hockney
LogP and
Friends
TCP
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex)
Flows
Imperfection
Topology
A Few Examples
Virtual
Topologies
Even if it does not model well complex situations, such a model is not
harmfull.
17 / 235
Fluid Modeling
Parallel
Algorithms
When using TCP-based networks, it is generally reasonnable to use
A. Legrand
flows to model bandwidth sharing [MR99, Low03].
P2P X
Communication
∀l ∈X L, Income Maximization maximize %r
Hockney
r ∈R
LogP and
Friends %r 6 c l
TCP
r ∈R s.t. l∈r Max-Min Fairness maximize min %r
Modeling r ∈R
Concurency X
Multi-port
Single-port
Proportional Fairness maximize log(%r )
(Pure and Full
Duplex) r ∈R
Flows
Imperfection
Topology
X 1
A Few Examples Potential Delay Minimization minimize
Virtual %r
Topologies r ∈R
X
Some weird function minimize arctan(%r )
r ∈R
18 / 235
Fluid Modeling
Parallel
Algorithms
When using TCP-based networks, it is generally reasonnable to use
A. Legrand
flows to model bandwidth sharing [MR99, Low03].
P2P X
Communication
∀l ∈X L, Income Maximization maximize %r
Hockney
r ∈R
LogP and
Friends %r 6 c l
TCP
r ∈R s.t. l∈r Max-Min Fairness maximize min %r ATM
Modeling r ∈R
Concurency X
Multi-port
Single-port
Proportional Fairness maximize log(%r )
(Pure and Full
Duplex) r ∈R
Flows
TCP Vegas
Imperfection
Topology
X 1
A Few Examples Potential Delay Minimization minimize
Virtual %r
Topologies r ∈R
X
Some weird function minimize arctan(%r )
r ∈R
TCP Reno
18 / 235
Flows Extensions
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney I Note that this model is a multi-port model with capacity-constraints
LogP and
Friends
TCP
(like in the previous bounded multi-port).
Modeling I When latencies are large, using multiple connections enables to
Concurency
Multi-port get more bandwidth. As a matter of fact, there is very few to
Single-port
(Pure and Full loose in using multiple connections. . .
Duplex)
Flows I Therefore many people enforce a sometimes artificial (but less
Imperfection
intrusive) bound on the maximum number of connections per
Topology
A Few Examples link [Wag05, MYCR06].
Virtual
Topologies
19 / 235
Outline
Parallel
Algorithms
20 / 235
Remind This is a Model, Hence Imperfect
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
I The previous sharing models are nice but you generally do not
LogP and
Friends
know other flows. . .
TCP
Modeling
I Communications use the memory bus and hence interfere with
Concurency computations. Taking such interferences into account may be-
Multi-port
Single-port come more and more important with multi-core architectures.
(Pure and Full
Duplex)
Flows
I Interference between communications are sometimes. . . surprising.
Imperfection
Modeling is an art. You have to know your platform and your applica-
Topology
A Few Examples tion to know what is negligeable and what is important. Even if your
Virtual
Topologies model is imperfect, you may still derive interesting results.
21 / 235
Outline
Parallel
Algorithms
22 / 235
Various Topologies Used in the Litterature
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
LogP and
Friends
TCP
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex)
Flows
Imperfection
Topology
A Few Examples
Virtual
Topologies
23 / 235
Parallel
Algorithms
A. Legrand
Beyond MPI_Comm_rank()?
P2P
Communication
Hockney
LogP and
So far, MPI gives us a unique number for each
Friends
TCP processor
Modeling
Concurency
With this one can do anything
Multi-port
Single-port
But it’s pretty inconvenient because one can do
(Pure and Full
Duplex) anything with it
Typically, one likes to impose constraints about
Flows
Imperfection
Topology
which processor/process can talk to which other
A Few Examples processor/process
Virtual
Topologies With this constraint, one can then think of the
algorithm in simpler terms
There are fewer options for communications between
processors
So there are fewer choices to implementing an
algorithm Courtesy of Henri Casanova
24 / 235
Parallel
Algorithms
A. Legrand
Virtual Topologies?
P2P
Communication
Hockney
LogP and
MPI provides an abstraction over physical computers
Friends
TCP
Each host has an IP address
Modeling MPI hides this address with a convenient numbers
Concurency
Multi-port
There could be multiple such numbers mapped to the same
Single-port
(Pure and Full
IP address
Duplex)
Flows
All “numbers” can talk to each other
Imperfection A Virtual Topology provides an abstraction over MPI
Topology Each process has a number, which may be different from
A Few Examples
Virtual
the MPI number
Topologies There are rules about which “numbers” a “number” can talk
to
A virtual topology is defined by specifying the
neighbors of each process
Courtesy of Henri Casanova
25 / 235
Parallel
Algorithms Implementing a Virtual
A. Legrand
Topology
P2P
Communication
Hockney
LogP and
Friends
TCP
0 1 2 3 4 5 6 7
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex) 0,0 (i,j) = (floor(log2(rank+1)), rank - 2max(i,0)+1)
Flows
rank = j -1 + 2max(i,0)
Imperfection
Topology
1,0 1,1
A Few Examples
Virtual
Topologies
2,0 2,1 2,2 2,3
3,0
Topology
1,0 1,1
A Few Examples my_parent(i,j) = (i-1, floor(j/2))
Virtual
Topologies my_left_child(i,j) = (i+1, j*2), if any
2,0 2,1 2,2 2,3 my_right_child(i,j) = (i+1, j*2+1), if any
3,0
Topology
1,0 1,1
A Few Examples my_parent(i,j) = (i-1, floor(j/2))
Virtual
Topologies my_left_child(i,j) = (i+1, j*2), if any
2,0 2,1 2,2 2,3 my_right_child(i,j) = (i+1, j*2+1), if any
MPI_Send(…, rank(my_parent(i,j)), …)
3,0
MPI_Recv(…, rank(my_left_child(i,j)), …)
Courtesy of Henri Casanova
28 / 235
Parallel
Algorithms
A. Legrand
Typical Topologies
P2P
Communication
Hockney
LogP and
Common Topologies (see Section 3.1.2)
Friends
TCP
Linear Array
Modeling
Ring
2-D grid
Concurency
Multi-port
Single-port
(Pure and Full
2-D torus
Duplex)
Flows
One-level Tree
Imperfection
Fully connected graph
Topology Arbitrary graph
Two options for all topologies:
A Few Examples
Virtual
Topologies
Monodirectional links: more constrained but
simpler
Bidirectional links: less constrained but
potential more complicated
By “complicated” we typically mean more bug-prone
We’ll look at Ring and Grid in detail Courtesy of Henri Casanova
29 / 235
Parallel
Algorithms Main Assumption and Big
A. Legrand
Question
P2P
Communication
Hockney The main assumption is that once we’ve defined the virtual
LogP and
Friends topology we forget it’s virtual and write parallel algorithms
TCP assuming it’s physical
Modeling We assume communications on different (virtual) links do not
Concurency
interfere with each other
Multi-port
Single-port
We assume that computations on different (virtual) processors
(Pure and Full do not interfere with each other
Duplex)
Flows The big question: How well do these assumptions hold?
Imperfection The question being mostly about the network
Topology Two possible “bad” cases
A Few Examples
Virtual
Case #1: the assumptions do not hold and there are
Topologies interferences
We’ll most likely achieve bad performance
Our performance models will be broken and reasoning about
performance improvements will be difficult
Case #2: the assumptions do hold but we leave a lot of the
network resources unutilized
We could perhaps do better with another virtual topology
Courtesy of Henri Casanova
30 / 235
Parallel
Algorithms Which Virtual Topology to
A. Legrand
Pick
P2P
Communication
Hockney
LogP and
We will see that some topologies are really well
Friends
TCP
suited to certain algorithms
Modeling
The question is whether they are well-suite to the
Concurency
Multi-port
underlying architecture
Single-port
(Pure and Full
The goal is to strike a good compromise
Duplex)
Flows
Not too bad given the algorithm
Imperfection
Not too bad given the platform
Topology
Fortunately, many platforms these days use
A Few Examples switches, which support naturally many virtual
topologies
Virtual
Topologies
Because they support concurrent communications
between disjoint pairs of processors
As part of a programming assignment, you will
explore whether some virtual topology makes
sense on our cluster Courtesy of Henri Casanova
31 / 235
Parallel
Algorithms Topologies and Data
A. Legrand
Distribution
P2P
Communication
Hockney
LogP and
One of the common steps when writing a
Friends
TCP parallel algorithm is to distribute some
Modeling
Concurency
data (array, data structure, etc.) among
Multi-port the processors in the topology
Single-port
(Pure and Full
Duplex)
Typically, one does data distribution in a way
Flows that matches the topology
Imperfection E.g., if the data is 3-D, then it’s nice to have a
Topology
3-D virtual topology
A Few Examples
Virtual
Topologies
One question that arises then is: how is
the data distributed across the topology?
In the next set of slides we look at our first
topology: a ring
Courtesy of Henri Casanova
32 / 235
Parallel
Algorithms
A. Legrand
Assumptions
Broadcast
Scatter
All-to-All Part II
Broadcast: Going
Faster
Communications on a Ring
33 / 235
Outline
Parallel
Algorithms
A. Legrand
Assumptions 5 Assumptions
Broadcast
Scatter
All-to-All 6 Broadcast
Broadcast: Going
Faster
7 Scatter
8 All-to-All
34 / 235
Parallel
Algorithms
A. Legrand
Ring Topology (Section 3.3)
Assumptions
Broadcast
Each processor is identified by a
rank
Scatter
MY_NUM()
All-to-All
Pp-1
There is a way to find the total
Broadcast: Going
Faster
P0 number of processors
NUM_PROCS()
Each processor can send a
P1 message to its successor
SEND(addr, L)
P3 And receive a message from its
predecessor
P2 RECV(addr, L)
A. Legrand
Virtual vs. Physical Topology
Assumptions
Broadcast
Scatter
It is actually difficult to precisely model the cost
of communication
All-to-All
E.g., MPI implementations do various optimizations
Broadcast: Going
Faster given the message sizes
We will be using a simple model
Time = L + m/B
L: start-up cost or latency
B: bandwidth (b = 1/B)
m: message size
Broadcast
Scatter
Several Options
Both Send() and Recv() are blocking
All-to-All
Broadcast: Going
Faster Called “rendez-vous”
Very old-fashioned systems
Recv() is blocking, but Send() is not
Pretty standard
MPI supports it
Both Recv() and Send() are non-blocking
Pretty standard as well
MPI supports it
Courtesy of Henri Casanova
38 / 235
Parallel
Algorithms Assumptions about
A. Legrand
Concurrency
Assumptions
Broadcast
One question that’s important is: can the processor
Scatter
do multiple things at the same time?
All-to-All Typically we will assume that the processor can
Broadcast: Going send, receive, and compute at the same time
Faster
Call MPI_IRecv() Call MPI_ISend()
Compute something
This of course implies that the three operations are
independent
E.g., you don’t want to send the result of the computation
E.g., you don’t want to send what you’re receiving
(forwarding)
When writing parallel algorithms (in pseudo-code),
we’ll simply indicate concurrent activities with a ||
sign
Courtesy of Henri Casanova
39 / 235
Parallel
Algorithms
A. Legrand
Collective Communications
Assumptions
All-to-All
collective operations
Broadcast: Going
Broadcasts, etc.
Faster
Now MPI provide those, and they likely:
Do not use the ring logical topology
Utilize the physical resources well
Let’s still go through the exercise of
writing some collective communication
algorithms
We will see that for some algorithms we
really want to do these communications
“by hand” on our virtual topology rather
than using the MPI collective Courtesy of Henri Casanova
40 / 235
Outline
Parallel
Algorithms
A. Legrand
Assumptions 5 Assumptions
Broadcast
Scatter
All-to-All 6 Broadcast
Broadcast: Going
Faster
7 Scatter
8 All-to-All
41 / 235
Parallel
Algorithms
A. Legrand
Broadcast (Section 3.3.1)
Assumptions
Broadcast
Scatter
We want to write a program that has Pk
All-to-All
send the same message of length m to all
Broadcast: Going
Faster other processors
Broadcast(k,addr,m)
On the ring, we just send to the next
processor, and so on, with no parallel
communications whatsoever
This is of course not the way one should
implement a broadcast in practice if the
physical topology is not merely a ring
MPI uses some type of tree topology
Courtesy of Henri Casanova
42 / 235
Parallel
Algorithms
A. Legrand
Broadcast (Section 3.3.1)
Assumptions
Broadcast
Brodcast(k,addr,m)
Scatter
q = MY_NUM()
All-to-All
p = NUM_PROCS()
Broadcast: Going
Faster if (q == k) Assumes a blocking
SEND(addr,m) receive
else Sending may be
if (q == k1 mod p) non-blocking
RECV(addr,m)
else The broadcast time
RECV(addr,m) is
SEND(addr,m)
endif (p-1)(L+m b)
endif
Parallel
Algorithms
A. Legrand
Assumptions 5 Assumptions
Broadcast
Scatter
All-to-All 6 Broadcast
Broadcast: Going
Faster
7 Scatter
8 All-to-All
44 / 235
Parallel
Algorithms
A. Legrand
Scatter (Section 3.2.2)
Assumptions
Broadcast
Scatter
Processor k sends a different message to
All-to-All all other processors (and to itself)
Broadcast: Going
Faster
Pk stores the message destined to Pq at
address addr[q], including a message at
addr[k]
At the end of the execution, each
processor holds the message it had
received in msg
The principle is just to pipeline
communication by starting to send the
message destined to Pk-1, the most distant
processor Courtesy of Henri Casanova
45 / 235
Parallel
Algorithms
A. Legrand
Scatter (Section 3.3.2)
Assumptions
Broadcast
Scatter(k,msg,addr,m) Same execution time as the broadcast
Scatter
All-to-All
q = MY_NUM() (p-1)(L + m b)
Broadcast: Going
p = NUM_PROCS()
Faster if (q == k)
for i = 0 to p2
SEND(addr[k+p1i mod p],m)
msg ← addr[k] Swapping of send buffer
else and receive buffer (pointer)
A. Legrand
Scatter (Section 3.3.2)
Assumptions
Broadcast
Scatter(k,msg,addr,m) k = 2, p = 4
Scatter q = MY_NUM()
p = NUM_PROCS() Proc q=2
All-to-All send addr[2+4-1-0 % 4 = 1]
if (q == k)
send addr[2+4-1-1 % 4 = 0]
Broadcast: Going for i = 0 to p2 send addr[2+4-1-2 % 4 = 3]
Faster SEND(addr[k+p1i mod p],m) msg = addr[2]
msg ← addr[k] Proc q=3
else recv (addr[1])
// loop 2-1-3 % 4 = 2 times
RECV(tempR,L)
send (addr[1]) || recv (addr[0])
for i = 1 to k1q mod p send (addr[0]) || recv (addr[3])
tempS ↔ tempR
SEND(tempS,m) || RECV(tempR,m) msg = addr[3]
msg ← tempR
Proc q=0
recv (addr[1])
// loop 2-1-2 % 4 = 1 time
send (addr[1]) || recv (addr[0])
0
1 msg = addr[0]
Proc q=1
3 // loop 2-1-1 % 4 = 0 time
recv (addr[1])
2
msg = addr[1]
Courtesy of Henri Casanova
47 / 235
Outline
Parallel
Algorithms
A. Legrand
Assumptions 5 Assumptions
Broadcast
Scatter
All-to-All 6 Broadcast
Broadcast: Going
Faster
7 Scatter
8 All-to-All
48 / 235
Parallel
Algorithms
A. Legrand
All-to-all (Section 3.3.3)
Assumptions
1 1
0 0
2 2
Courtesy of Henri Casanova
49 / 235
Outline
Parallel
Algorithms
A. Legrand
Assumptions 5 Assumptions
Broadcast
Scatter
All-to-All 6 Broadcast
Broadcast: Going
Faster
7 Scatter
8 All-to-All
50 / 235
Parallel
Algorithms
A. Legrand
A faster broadcast?
Assumptions
All-to-All
One can cut the message in many small
Broadcast: Going
Faster
pieces, say in r pieces where m is divisible by
r.
The root processor just sends r messages
The performance is as follows
Consider the last processor to get the last piece of the
message
There need to be p-1 steps for the first piece to arrive,
which takes (p-1)(L + m b / r)
Then the remaining r-1 pieces arrive one after another,
which takes (r-1)(L + m b / r)
For a total of: (p - 2 + r) (L + mb / r) Courtesy of Henri Casanova
51 / 235
Parallel
Algorithms
A. Legrand
A faster broadcast?
Assumptions
Broadcast
The question is, what is the value of r that minimizes
Scatter
(p - 2 + r) (L + m b / r) ?
All-to-All
A. Legrand
Well-known Network Principle
Assumptions
Broadcast
Scatter
We have seen that if we cut a (large) message in
many (small) messages, then we can send the
message over multiple hops (in our case p-1)
All-to-All
A. Legrand
Speedup
Amdahl’s Law
Part III
54 / 235
Outline
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
10 Speedup
11 Amdahl’s Law
55 / 235
Speedup
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
I We need a metric to quantify the impact of your performance
enhancement
I Speedup: ratio of “old” time to “new” time
I new time = 1h
I speedup = 2h / 1h = 2
I Sometimes one talks about a “slowdown” in case the “enhance-
ment” is not beneficial
I Happens more often than one thinks
56 / 235
Parallel Performance
Parallel
Algorithms
A. Legrand
Speedup
57 / 235
Speedup
Parallel
speedup
Algorithms
r
A. Legrand
ea
rlin
Speedup linear
pe
su
Amdahl’s Law
sub-linear
number of processors
58 / 235
Speedup
Parallel
speedup
Algorithms
r
A. Legrand
ea
rlin
Speedup linear
pe
su
Amdahl’s Law
sub-linear
number of processors
58 / 235
Speedup
Parallel
speedup
Algorithms
r
A. Legrand
ea
rlin
Speedup linear
pe
su
Amdahl’s Law
sub-linear
number of processors
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
10 Speedup
11 Amdahl’s Law
59 / 235
Bad News: Amdahl’s Law
Parallel
Algorithms
1 2
Sp = f
1−f + p 1
0
10 20 30 40 50 60
Number of processors
60 / 235
Lessons from Amdahl’s Law
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
I It’s a law of diminishing return
I If a significant fraction of the code (in terms of time spent in it)
is not parallelizable, then parallelization is not going to be good
I It sounds obvious, but people new to high performance computing
often forget how bad Amdahl’s law can be
I Luckily, many applications can be almost entirely parallelized and
f is small
61 / 235
Parallel Efficiency
Parallel
Algorithms
A. Legrand
Speedup
62 / 235
Scalability
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law I Measure of the “effort” needed to maintain efficiency while adding
processors
I Efficiency also depends on the problem size: Eff (n, p)
I Isoefficiency: At which rate does the problem size need to be
increase to maintain efficiency
I nc (p) such that Eff (nc (p), p) = c
I By making a problem ridiculously large, on can typically achieve
good efficiency
I Problem: is it how the machine/code will be used?
63 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
First MPI
Version
Distributing
Part IV
Matrices
Second MPI
Version
Third MPI
Version Algorithms on a Ring
Mixed
Parallelism
Version
Matrix
Multiplication
Stencil
Application
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
64 / 235
Outline
Parallel
Algorithms 12 Matrix Vector Product
A. Legrand
Open MP Version
Matrix Vector First MPI Version
Product
Open MP Distributing Matrices
Version
First MPI Second MPI Version
Version
Distributing Third MPI Version
Matrices
Second MPI Mixed Parallelism Version
Version
Third MPI
Version 13 Matrix Multiplication
Mixed
Parallelism
Version 14 Stencil Application
Matrix
Multiplication
Principle
Stencil Greedy Version
Application
Principle
Reducing the Granularity
Greedy Version
Reducing the 15 LU Factorization
Granularity
LU Factorization
Gaussian Elimination
Gaussian
Elimination
LU
LU
65 / 235
Parallel
Algorithms
A. Legrand
Parallel Matrix-Vector product
Matrix Vector
Product
Open MP
Version
y=Ax
First MPI Let n be the size of the matrix
Version
Distributing int a[n][n];
Matrices int x[n];
Second MPI
Version for i = 0 to n1 {
Third MPI y[i] = 0;
Version for j = 0 to n1
Mixed y[i] = y[i] + a[i,j] * x[j];
Parallelism
Version }
Matrix x[N]
Multiplication
Stencil
Application
How do we do this in
Principle parallel?
Greedy Version
Reducing the
Granularity y[N]
LU Factorization
Gaussian Section 4.1 in the book
Elimination
LU a[N][N]
Courtesy of Henri Casanova
66 / 235
Parallel
Algorithms
A. Legrand
Parallel Matrix-Vector product
Matrix Vector
Product
For example:
Version
Distributing
Matrices
Second MPI
Version
Computations of elements of
Third MPI
Version
vector y are independent
Mixed Each of these computations
Parallelism
Version
Matrix
requires one row of matrix a and x[N]
Multiplication vector x
In shared-memory:
Stencil
Application
Principle
Greedy Version #pragma omp parallel for private(i,j)
Reducing the for i = 0 to n1 {
Granularity y[N]
y[i] = 0;
LU Factorization for j = 0 to n1
Gaussian
Elimination
y[i] = y[i] + a[i,j] * x[j];
LU } a[N][N]
Courtesy of Henri Casanova
67 / 235
Parallel
Algorithms
A. Legrand
Parallel Matrix-Vector Product
Matrix Vector
Product
Open MP
Version
In distributed memory, one possibility is that
First MPI
Version
each process has a full copy of matrix a and of
Distributing
Matrices
vector x
Second MPI
Version
Each processor declares a vector y of size n/p
Third MPI
Version
We assume that p divides n
Mixed
Parallelism
Therefore, the code can just be
Version
load(a); load(x)
Matrix
Multiplication p = NUM_PROCS(); r = MY_RANK();
Stencil for (i=r*n/p; i<(r+1)*n/p; i++) {
Application
Principle
for (j=0;j<n;j++)
Greedy Version y[ir*n/p] = a[i][j] * x[j];
Reducing the
Granularity }
LU Factorization
Gaussian
It’s embarrassingly parallel
Elimination
LU
What about the result?
Courtesy of Henri Casanova
68 / 235
Parallel
Algorithms
A. Legrand
What about the result?
Matrix Vector
Product
Open MP
Version
After the processes complete the computation, each
First MPI
Version
process has a piece of the result
Distributing
Matrices
One probably wants to, say, write the result to a file
Second MPI Requires synchronization so that the I/O is done correctly
Version
Third MPI For example
Version . . .
Mixed if (r != 0) {
Parallelism
Version recv(&token,1);
}
Matrix
open(file, “append”);
Multiplication
for (j=0; j<n/p ; j++)
Stencil write(file, y[j]);
Application send(&token,1);
Principle close(file)
Greedy Version barrier(); // optional
Reducing the
Granularity
Could also use a “gather” so that the entire vector is
LU Factorization
Gaussian returned to processor 0
Elimination
LU
vector y fits in the memory of a single node
Courtesy of Henri Casanova
69 / 235
Parallel
Algorithms
A. Legrand
What if matrix a is too big?
Matrix Vector
Product
Open MP
Version Matrix a may not fit in memory
First MPI
Version Which is a motivation to use distributed memory
Distributing
Matrices implementations
In this case, each processor can store only a
Second MPI
Version
piece of matrix a
Third MPI
Version
Mixed
Parallelism
Version
For the matrix-vector multiply, each processor
Matrix can just store n/p rows of the matrix
Multiplication
Conceptually: A[n][n]
Stencil
Application But the program declares a[n/p][n]
Principle
Greedy Version This raises the (annoying) issue of global indices
Reducing the
Granularity versus local indices
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
70 / 235
Parallel
Algorithms
A. Legrand
Global vs. Local indices
Matrix Vector
Product
Open MP
Version When an array is split among processes
First MPI
Version
global index (I,J) that references an element of the matrix
Distributing
Matrices
local index (i,j) that references an element of the local array
Second MPI that stores a piece of the matrix
Version
Third MPI
Translation between global and local indices
Version
Mixed
think of the algorithm in terms of global indices
Parallelism implement it in terms of local indices
Version
Matrix
Multiplication
Stencil
Global: A[5][3]
Application
Principle
P0 Local: a[1][3] on process P1
Greedy Version
Reducing the
Granularity
P1 a[i,j] = A[(n/p)*rank + i][j]
LU Factorization
P2 n/p
Gaussian
Elimination
LU
N Courtesy of Henri Casanova
71 / 235
Parallel
Algorithms
A. Legrand
Global Index Computation
Matrix Vector
Product
Open MP
Version Real-world parallel code often implements actual
First MPI
Version translation functions
Distributing
Matrices GlobalToLocal()
Second MPI
Version LocalToGlobal()
Third MPI
Version
Mixed
Parallelism
This may be a good idea in your code, although
Version for the ring topology the computation is pretty
Matrix
Multiplication easy, and writing functions may be overkill
Stencil
Application
We’ll see more complex topologies with more
Principle complex associated data distributions and then
Greedy Version
Reducing the it’s probably better to implement such functions
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
72 / 235
Parallel
Algorithms
A. Legrand
Distributions of arrays
Matrix Vector
Product
Open MP
Version
At this point we have
2-D array a distributed
First MPI
Version
Distributing
Matrices 1-D array y distributed
Second MPI
Version 1-D array x replicated
Third MPI
Stencil
global/local indices translations
Application It may require synchronization to load/save the
Principle
Greedy Version array elements to file
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
73 / 235
Parallel
Algorithms
A. Legrand
All vector distributed?
Matrix Vector
Product
Open MP
Version So far we have array x replicated
First MPI
Version
Distributing
It is usual to try to have all arrays involved in the
Matrices
Second MPI
same computation be distributed in the same
Version
Third MPI
way
Version makes it easier to read the code without constantly
Mixed
Parallelism keeping track of what’s distributed and what’s not
Version
Matrix
e.g., “local indices for array y are different from the global
Multiplication ones, but local indices for array x are the same as the
Stencil
global ones” will lead to bugs
Application
Principle
What one would like it for each process to have
Greedy Version N/n rows of matrix A in an array a[n/p][n]
Reducing the
Granularity N/n components of vector x in an array x[n/p]
LU Factorization
Gaussian
N/n components of vector y in an array y[n/p]
Turns out there is an elegant solution to do this
Elimination
LU
Courtesy of Henri Casanova
74 / 235
Parallel
Algorithms
A. Legrand
Principle of the Algorithm
Matrix Vector
Product
Open MP
Version
First MPI A00 A01 A02 A03 A04 A05 A06 A07 x0
Version
Distributing
P0 A10 A11 A12 A13 A14 A15 A16 A17 x1
Matrices
Second MPI
Version A20 A21 A22 A23 A24 A25 A26 A27 x2
Third MPI P1 A30 A31 A32 A33 A34 A35 A36 A37 x3 Initial data distribution
Version
Mixed
for:
Parallelism
A40 A41 A42 A43 A44 A45 A46 A47 x4 n=8
Version
P2 A50 A51 A52 A53 A54 A55 A56 A57 x5 p=4
Matrix
Multiplication n/p = 2
A60 A61 A62 A63 A64 A65 A66 A67 x6
Stencil
Application P3 A70 A71 A72 A73 A74 A75 A76 A77 x7
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
75 / 235
Parallel
Algorithms
A. Legrand
Principle of the Algorithm
Matrix Vector
Product
Open MP
Version
First MPI A00 A01 ● ● ● ● ● ● x0
Version
Distributing
P0 A10 A11 ● ● ● ● ● ● x1
Matrices
Second MPI
Version ● ● A22 A23 ● ● ● ● x2
Third MPI P1 ● ● A32 A33 ● ● ● ● x3
Version
Mixed
Parallelism
● ● ● ● A44 A45 ● ● x4
Version
P2 ● ● ● ● A54 A55 ● ● x5
Matrix
Multiplication
● ● ● ● ● ● A66 A67 x6
Stencil
Application P3 ● ● ● ● ● ● A76 A77 x7
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 0
LU
Courtesy of Henri Casanova
76 / 235
Parallel
Algorithms
A. Legrand
Principle of the Algorithm
Matrix Vector
Product
Open MP
Version
First MPI ● ● ● ● ● ● A06 A07 x6
Version
Distributing
P0 ● ● ● ● ● ● A16 A17 x7
Matrices
Second MPI
Version A20 A21 ● ● ● ● ● ● x0
Third MPI P1 A30 A31 ● ● ● ● ● ● x1
Version
Mixed
Parallelism
● ● A42 A43 ● ● ● ● x2
Version
P2 ● ● A52 A53 ● ● ● ● x3
Matrix
Multiplication
● ● ● ● A64 A65 ● ● x4
Stencil
Application P3 ● ● ● ● A74 A75 ● ● x5
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 1
LU
Courtesy of Henri Casanova
77 / 235
Parallel
Algorithms
A. Legrand
Principle of the Algorithm
Matrix Vector
Product
Open MP
Version
First MPI ● ● ● ● A04 A05 ● ● x4
Version
Distributing
P0 ● ● ● ● A14 A15 ● ● x5
Matrices
Second MPI
Version ● ● ● ● ● ● A26 A27 x6
Third MPI P1 ● ● ● ● ● ● A36 A37 x7
Version
Mixed
Parallelism
A40 A41 ● ● ● ● ● ● x0
Version
P2 A50 A51 ● ● ● ● ● ● x1
Matrix
Multiplication
● ● A62 A63 ● ● ● ● x2
Stencil
Application P3 ● ● A72 A73 ● ● ● ● x3
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 2
LU
Courtesy of Henri Casanova
78 / 235
Parallel
Algorithms
A. Legrand
Principle of the Algorithm
Matrix Vector
Product
Open MP
Version
First MPI ● ● A02 A03 ● ● ● ● x2
Version
Distributing
P0 ● ● A12 A13 ● ● ● ● x3
Matrices
Second MPI
Version ● ● ● ● A24 A25 ● ● x4
Third MPI P1 ● ● ● ● A34 A35 ● ● x5
Version
Mixed
Parallelism
● ● ● ● ● ● A46 A47 x6
Version
P2 ● ● ● ● ● ● A56 A57 x7
Matrix
Multiplication
A60 A61 ● ● ● ● ● ● x0
Stencil
Application P3 A70 A71 ● ● ● ● ● ● x1
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 3
LU
Courtesy of Henri Casanova
79 / 235
Parallel
Algorithms
A. Legrand
Principle of the Algorithm
Matrix Vector
Product
Open MP
Version
First MPI A00 A01 ● ● ● ● ● ● x0
Version
Distributing
P0 A10 A11 ● ● ● ● ● ● x1
Matrices
Second MPI
The final exchange of
Version ● ● A22 A23 ● ● ● ● x2 vector x is not strictly
Third MPI P1 ● ● A32 A33 ● ● ● ● x3
Version necessary, but one may
Mixed want to have it
Parallelism
● ● ● ● A44 A45 ● ● x4
Version
P2 distributed as the end of
● ● ● ● A54 A55 ● ● x5
Matrix the computation like it
Multiplication
● ● ● ● ● ● A66 A67 x6 was distributed at the
Stencil
Application P3 ● ● ● ● ● ● A76 A77 x7 beginning.
Principle
Greedy Version
Reducing the
Granularity
LU Factorization Final state
Gaussian
Elimination
LU
Courtesy of Henri Casanova
80 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP Uses two buffers
Version
First MPI tempS for sending and tempR to receiving
Version
Distributing
Matrices
Second MPI
Version float A[n/p][n], x[n/p], y[n/p];
Third MPI
Version r ← n/p
Mixed tempS ← x /* My piece of the vector (n/p elements) */
Parallelism for (step=0; step<p; step++) { /* p steps */
Version
SEND(tempS,r)
Matrix RECV(tempR,r)
Multiplication
for (i=0; i<n/p; i++)
Stencil for (j=0; j <n/p; j++)
Application y[i] ← y[i] + a[i,(rank step mod p) * n/p + j] * tempS[j]
Principle tempS ↔ tempR
Greedy Version }
Reducing the
Granularity
In our example, process of rank 2 at step 3 would work with
LU Factorization
Gaussian
the 2x2 matrix block starting at column ((2 - 3) mod 4)*8/4
Elimination = 3 * 8 / 4 = 6;
LU
Courtesy of Henri Casanova
81 / 235
Parallel
Algorithms
A. Legrand
A few General Principles
Matrix Vector
A. Legrand
Performance
Matrix Vector
Product
Open MP
Version
There are p identical steps
First MPI
Version
Distributing
During each step each processor performs
Matrices
Second MPI
three activities: computation, receive, and
Version
Third MPI sending
Version
Mixed
Computation: r2 w
Parallelism
Version
w: time to perform one += * operation
Matrix
Multiplication
Receiving: L + r b
Stencil
Sending: L + r b
Application
Principle
A. Legrand
Asymptotic Performance
Matrix Vector
Product
Stencil
Conclusion: the algorithm is
Application
Principle asymptotically optimal
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
84 / 235
Parallel
Algorithms
A. Legrand
Performance (2)
Matrix Vector
Product
Open MP
Version Note that an algorithm that initially broadcasts the
First MPI
Version entire vector to all processors and then have every
Distributing
Matrices
Second MPI
processor compute independently would be in time
Version
Third MPI
Version
(p-1)(L + n b) + pr2 w
Mixed
Parallelism
Could use the pipelined broadcast
Version
Matrix
which:
Multiplication
has the same asymptotic performance
Stencil
Application is a simpler algorithm
Principle
Greedy Version wastes only a tiny little bit of memory
Reducing the
Granularity is arguably much less elegant
LU Factorization
Gaussian
It is important to think of simple solutions and see
Elimination
LU what works best given expected matrix size, Courtesy
etc. of Henri Casanova
85 / 235
Parallel
Algorithms
A. Legrand
Back to the Algorithm
Matrix Vector
Product
Open MP
Version float A[n/p][n], x[n/p], y[n/p];
First MPI
Version
r ← n/p
Distributing tempS ← x /* My piece of the vector (n/p elements) */
Matrices
for (step=0; step<p; step++) { /* p steps */
Second MPI
Version SEND(tempS,r)
Third MPI
Version
RECV(tempR,r)
Mixed for (i=0; i<n/p; i++)
Parallelism for (j=0; j <n/p; j++)
Version
y[i] ← y[i] + a[i,(rank step mod p) * n/p + j] * tempS[j]
Matrix
Multiplication tempS ↔ tempR
}
Stencil
Application
In the above code, at each iteration, the SEND, the RECV,
Principle and the computation can all be done in parallel
Greedy Version
Reducing the Therefore, one can overlap communication and
Granularity
computation by using non-blocking SEND and RECV if
LU Factorization
Gaussian
available
Elimination
LU
MPI provides MPI_ISend() and MPI_IRecv() for this purpose
Courtesy of Henri Casanova
86 / 235
Parallel
Algorithms
A. Legrand
Nore Concurrent Algorithm
Matrix Vector
Product
Open MP
Version
Notation for concurrent activities:
First MPI
Version
Distributing float A[n/p][n], x[n/p], y[n/p];
Matrices
Second MPI
tempS ← x /* My piece of the vector (n/p elements) */
Version r ← n/p
Third MPI
Version for (step=0; step<p; step++) { /* p steps */
Mixed
Parallelism SEND(tempS,r)
Version || RECV(tempR,r)
Matrix || for (i=0; i<n/p; i++)
Multiplication
for (j=0; j <n/p; j++)
Stencil
Application y[i] ← y[i]+a[i,(rankstep mod p)*n/p+j]*tempS[j]
Principle tempS ↔ tempR
Greedy Version
Reducing the
}
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
87 / 235
Parallel
Algorithms
A. Legrand
Better Performance
Matrix Vector
Product
Open MP
Version
There are p identical steps
First MPI
Version
Distributing
During each step each processor performs
Matrices
Second MPI
three activities: computation, receive, and
Version
Third MPI sending
Version
Mixed
Computation: r2w
Parallelism
Version Receiving: L + rb
Sending: L + rb
Matrix
Multiplication
Stencil
Application
Principle
Greedy Version T(p) = p max(r2w , L + rb)
Reducing the
Granularity
LU Factorization Same asymptotic performance as above, but
Gaussian
Elimination
better performance for smaller values of n
LU
Courtesy of Henri Casanova
88 / 235
Parallel
Algorithms
A. Legrand
Hybrid parallelism
Matrix Vector
Product
Open MP
Version
We have said many times that multi-core
First MPI
Version
architectures are about to become the standard
Distributing
Matrices
When building a cluster, the nodes you will buy will
Second MPI
Version be multi-core
Third MPI
Version
Question: how to exploit the multiple cores?
Or in our case how to exploit the multiple
Mixed
Parallelism
Version
Matrix
processors in each node
Multiplication Option #1: Run multiple processes per node
Causes more overhead and more
Stencil
Application
Principle
Greedy Version
communication
Reducing the In fact will cause network communication among
Granularity
LU Factorization
processes within a node!
Gaussian MPI will not know that processes are co-
Elimination
LU located Courtesy of Henri Casanova
89 / 235
Parallel
Algorithms
A. Legrand
OpenMP MPI Program
Matrix Vector
Product
Open MP
Version Option #2: Run a single multi-threaded process
First MPI
Version per node
Distributing
Matrices Much lower overhead, fast communication
Second MPI
Version
Third MPI
within a node
Done by combining MPI with OpenMP!
Version
Mixed
Parallelism
Version Just write your MPI program
Matrix
Multiplication Add OpenMP pragmas around loops
Stencil
Application Let’s look back at our Matrix-Vector multiplication
Principle
Greedy Version example
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
90 / 235
Parallel
Algorithms
A. Legrand
Hybrid Parallelism
Matrix Vector
Product
Open MP float A[n/p][n], x[n/p], y[n/p];
Version
First MPI tempS ← x /* My piece of the vector (n/p elements) */
Version
for (step=0; step<p; step++) { /* p steps */
Distributing
Matrices SEND(tempS,r)
Second MPI
Version || RECV(tempR,r)
Third MPI
Version
|| #pragma omp parallel for private(i,j)
Mixed for (i=0; i<n/p; i++)
Parallelism
Version for (j=0; j <n/p; j++)
Matrix y[i] ← y[i] + a[i,(rank step mod p)*n/p+j]*
Multiplication
tempS[j]
Stencil tempS ↔ tempR
Application
Principle }
Greedy Version
Reducing the
Granularity
This is called Hybrid Parallelism
LU Factorization
Communication via the network among nodes
Gaussian
Elimination
Communication via the shared memory within nodes
LU
Courtesy of Henri Casanova
91 / 235
Outline
Parallel
Algorithms 12 Matrix Vector Product
A. Legrand
Open MP Version
Matrix Vector First MPI Version
Product
Open MP Distributing Matrices
Version
First MPI Second MPI Version
Version
Distributing Third MPI Version
Matrices
Second MPI Mixed Parallelism Version
Version
Third MPI
Version 13 Matrix Multiplication
Mixed
Parallelism
Version 14 Stencil Application
Matrix
Multiplication
Principle
Stencil Greedy Version
Application
Principle
Reducing the Granularity
Greedy Version
Reducing the 15 LU Factorization
Granularity
LU Factorization
Gaussian Elimination
Gaussian
Elimination
LU
LU
92 / 235
Parallel
Algorithms Matrix Multiplication on the
A. Legrand
Ring
Matrix Vector
Product
Open MP
Version
See Section 4.2
Turns out one can do matrix multiplication in a
First MPI
Version
Distributing
Matrices way very similar to matrix-vector multiplication
A matrix multiplication is just the computation
Second MPI
Version
Third MPI
Version of n2 scalar products, not just n
Mixed
Parallelism
Version
We have three matrices, A, B, and C
Matrix
We want to compute C = A*B
Multiplication
Stencil
We distribute the matrices to that each processor
Application “owns” a block row of each matrix
Easy to do if row-major is used because all
Principle
Greedy Version
Reducing the
Granularity
matrix elements owned by a processor are
LU Factorization contiguous in memory
Gaussian
Elimination
LU
Courtesy of Henri Casanova
93 / 235
Parallel
Algorithms
A. Legrand
Data Distribution
Matrix Vector
Product
Open MP
Version
First MPI
Version r
B
Distributing
Matrices
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application
A C
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
94 / 235
Parallel
Algorithms
A. Legrand
First Step
Matrix Vector
Product
Open MP
Version
First MPI
p=4
Version r B1,0 B1,1 B1,2 B1,3
Distributing
Matrices let’s look at
processor P1
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application += += += +=
Principle
A1,0 A1,1 A1,2 A1,3 A1,1xB1,0 A1,1xB1,1 A1,1xB1,2 A1,1xB1,3
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
95 / 235
Parallel
Algorithms
A. Legrand
Shifting of block rows of B
Matrix Vector
Product
Open MP
Version
First MPI
p=4
Version r
Distributing
Matrices let’s look at
processor Pq
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application
Principle
Aq,0 Aq,1 Aq,2 Aq,3
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
96 / 235
Parallel
Algorithms
A. Legrand
Second step
Matrix Vector
Product
Open MP
Version
First MPI
p=4
Version r B0,0 B0,1 B0,2 B0,3
Distributing
Matrices let’s look at
processor P1
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application += += += +=
Principle
A1,0 A1,1 A1,2 A1,3 A1,0xB0,0 A1,0xB0,1 A1,0xB0,2 A1,0xB0,3
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
97 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP
Version
In the end, every Ci,j block has the correct value: Ai,0B0,j + Ai,1B1,j +
First MPI ...
Version
Distributing Basically, this is the same algorithm as for matrix-vector
Matrices
Second MPI
multiplication, replacing the partial scalar products by submatrix
Version products (gets tricky with loops and indices)
Third MPI
Version
Mixed float A[N/p][N], B[N/p][N], C[N/p][N];
Parallelism r ← N/p
Version
tempS ← B
Matrix
Multiplication
q ← MY_RANK()
for (step=0; step<p; step++) { /* p steps */
Stencil
SEND(tempS,r*N)
Application
Principle
|| RECV(tempR,r*N)
Greedy Version || for (l=0; l<p; l++)
Reducing the for (i=0; i<N/p; i++)
Granularity
for (j=0; j<N/p; j++)
LU Factorization for (k=0; k<N/p; k++)
Gaussian
Elimination C[i,l*r+j] ← C[i,l*r+j] + A[i,r((q step)%p)+k] * tempS[k,l*r+j]
LU tempS ↔ tempR
Courtesy of Henri Casanova
}
98 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=0
Stencil l=0
i=0
Application
Principle
Greedy Version
Reducing the
Granularity j=0
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
99 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=0
Stencil l=0
i=0
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
100 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=0
Stencil l=0
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
101 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=0
Stencil l=1
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
102 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=0
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
103 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=1
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
104 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=2
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
105 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=3
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
106 / 235
Parallel
Algorithms
A. Legrand
Performance
Matrix Vector
Product
Open MP
Version
Performance Analysis is straightforward
First MPI
Version
Distributing
p steps and each step takes time:
max (nr2 w, L + nrb)
Matrices
Second MPI
Version
Third MPI
p rxr matrix products = pr3 = nr2 operations
Version
Mixed
Parallelism
Hence, the running time is:
T(p) = p max (nr2 w , L + nrb)
Version
Matrix
Multiplication
Note that a naive algorithm computing n
Matrix-vector products in sequence using
Stencil
Application
Principle
Greedy Version our previous algorithm would take time
Reducing the
Granularity T(p) = p max(nr2 w , nL + nrb)
LU Factorization
Gaussian
Elimination
We just saved network latencies!
LU
Courtesy of Henri Casanova
107 / 235
Outline
Parallel
Algorithms 12 Matrix Vector Product
A. Legrand
Open MP Version
Matrix Vector First MPI Version
Product
Open MP Distributing Matrices
Version
First MPI Second MPI Version
Version
Distributing Third MPI Version
Matrices
Second MPI Mixed Parallelism Version
Version
Third MPI
Version 13 Matrix Multiplication
Mixed
Parallelism
Version 14 Stencil Application
Matrix
Multiplication
Principle
Stencil Greedy Version
Application
Principle
Reducing the Granularity
Greedy Version
Reducing the 15 LU Factorization
Granularity
LU Factorization
Gaussian Elimination
Gaussian
Elimination
LU
LU
108 / 235
Parallel
Algorithms Stencil Application (Section
A. Legrand
4.3)
Matrix Vector
Product
Open MP
We’ve talked about stencil applications in the
Version context of shared-memory programs
First MPI
Version
Distributing
Matrices
Second MPI
Version
0 1 2 3 4 5 6 t+1
Third MPI
Version
1 2 3 4 5 6 7 t+1 t
Mixed
Parallelism
2 3 4 5 6 7 8
Version
3 4 5 6 7 8 9
Matrix new = update(old,W,N)
Multiplication 4 5 6 7 8 9 10
Stencil 5 6 7 8 9 10 11
Application
Principle 6 7 8 9 10 11 12
Greedy Version
Reducing the
Granularity
LU Factorization
We found that we had to cut the matrix in “small”
Gaussian blocks
Elimination
LU
On a ring the same basic idea applies, but let’s do it step-
by-step Courtesy of Henri Casanova
109 / 235
Parallel
Algorithms
A. Legrand
Stencil Application
Matrix Vector
Product
Open MP
Version
Let us, for now, consider that the domain is of size nxn and
First MPI that we have p=n processors
Version
Distributing
Classic way to first approach a problem
Matrices
Second MPI
Each processor is responsible for computing one row of the
Version domain (at each iteration)
Third MPI
Version
Each processor holds one row of the domain and has the
Mixed
Parallelism
following declaration:
Version var A: array[0..n-1] of real
Matrix
Multiplication
One first simple idea is to have each processor send each
cell value to its neighbor as soon as that cell value is
Stencil
Application
computed
Principle Basic principle: do communication as early as possible to
Greedy Version
get your “neighbors” started as early as possible
Reducing the
Granularity Remember that one of the goals of a parallel program is to
LU Factorization reduce idle time on the processors
Gaussian
Elimination
We call this algorithm the Greedy algorithm, and seek an
LU evaluation of its performance
Courtesy of Henri Casanova
110 / 235
Parallel
Algorithms
A. Legrand
The Greedy Algorithm
Matrix Vector
Product
Open MP q = MY_NUM()
Version p = NUM_PROCS
First MPI
Version if (q == 0) then
Distributing A[0] = Update(A[0],nil,nil)
Matrices Send(A[0],1)
Second MPI else
Version First element of the row
Third MPI Recv(v,1)
Version A[0] = Update(A[0],nil,v)
Mixed endif
Parallelism
Version for j = 1 to n-1
if (q == 0) then
Matrix
Multiplication A[j] = Update(A[j], A[j-1], nil)
Send(A[j],1)
Stencil elsif (q == p-1) then
Application
Recv(v,1)
Principle
A[j] = Update(A[j], A[j-1], v)
Other elements
Greedy Version
Reducing the else
Granularity Send(A[j-1], 1) || Recv(v,1)
LU Factorization A[j] = Update(A[j], A[j-1], v)
Gaussian endif
Elimination endfor note the use of “nil”
LU for borders and corners
Courtesy of Henri Casanova
111 / 235
Parallel
Algorithms
A. Legrand
Greedy Algorithm
Matrix Vector
Product
Open MP
Version
This is all well and good, but typically we have n > p
First MPI
Version
Assuming that p divides n, each processor will hold n/p
Distributing rows
Matrices
Second MPI
Good for load balancing
Version The goal of a greedy algorithm is always to allow
Third MPI
Version processors to start computing as early as possible
Mixed
Parallelism
This suggests a cyclic allocation of rows among processors
Version
Matrix P0
Multiplication P1
Stencil
P2
Application P0
Principle P1
Greedy Version P2
Reducing the P0
Granularity P1
LU Factorization P2
Gaussian
Elimination
LU P1 can start computing after P0 has computed its Courtesy
first cell
of Henri Casanova
112 / 235
Parallel
Algorithms
A. Legrand
Greedy Algorithm
Matrix Vector
Product
Open MP
Version
Each processor holds n/p rows of the domain
First MPI
Version Thus it declares:
Distributing
Matrices
Second MPI
var A[0..n/p-1,n] of real
Version
Third MPI
Which is a contiguous array of rows, with these
Version
Mixed
rows not contiguous in the domain
Therefore we have a non-trivial mapping
Parallelism
Version
Matrix between global indices and local indices, but
we’ll see that they don’t appear in the code
Multiplication
Stencil
Application Let us rewrite the algorithm
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
113 / 235
Parallel
Algorithms
A. Legrand
The Greedy Algorithm
Matrix Vector
Product p = MY_NUM()
Open MP q = NUM_PROCS
Version
For i = 0 to n/p -1
First MPI
Version if (q == 0) and (i == 0) then
Distributing A[0,0] = Update(A[0,0],nil,nil)
Matrices Send(A[0],1)
Second MPI
Version else
Third MPI Recv(v,1)
Version A[i,0] = Update(A[i,0],nil,v)
Mixed endif
Parallelism
Version for j = 1 to n-1
if (q == 0) and (i == 0) then
Matrix
Multiplication A[i,j] = Update(A[i,j], A[i,j-1], nil)
Send(A[i,j],1)
Stencil
elsif (q == p-1) and (i = n/p-1) then
Application
Recv(v,1)
Principle
Greedy Version
A[i,j] = Update(A[i,j], A[i-1,j], v)
Reducing the else
Granularity Send(A[i,j-1], 1) || Recv(v,1)
LU Factorization A[i,j] = Update(A[i,j], A[i-1,j-1], v)
Gaussian endif
Elimination endfor
LU endfor Courtesy of Henri Casanova
114 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Matrix Vector
Product
Open MP
Version
Let T(n,p) denote the computation time of the algorithm for
First MPI a nxn domain and with p processors
Version
Distributing
A each step a processor does at most three things
Matrices
Second MPI
Receive a cell
Version Send a cell
Third MPI
Version
Update a cell
Mixed
Parallelism
The algorithm is “clever” because at each step k, the
Version sending of messages from step k is overlapped with the
Matrix receiving of messages at step k+1
Multiplication
Therefore, the time needed to compute one algorithm step
Stencil
Application
is the sum of
Principle Time to send/receive a cell: L+b
Greedy Version Time to perform a cell update: w
Reducing the
Granularity So, if we can count the number of steps, we can simply
LU Factorization multiply and get the overall execution time
Gaussian
Elimination
LU
Courtesy of Henri Casanova
115 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Matrix Vector
Product
Open MP
It takes p-1 steps before processor Pp-1 can start computing
Version
First MPI
its first cell
Version
Distributing
Thereafter, this processor can compute one cell at every step
Matrices
Second MPI
The processor holds n*n/p cells
Version
Third MPI
Therefore, the whole program takes: p-1+n*n/p steps
Version
Mixed
And the overall execution time:
Parallelism
Version T(n,p) = (p - 1 + n2/p ) (w + L + b)
Matrix The sequential time is: n2w
Multiplication
The Speedup, S(n,p) = n2w / T(n,p)
Stencil
Application When n gets large, T(n,p) ~ n2/p (w + L + b)
Principle
Greedy Version
Therefore, Eff(n,p) ~ w / (w + L + b)
Reducing the
Granularity
This could be WAY below one
LU Factorization
In practice, and often, L + b >> w
Gaussian
Elimination
Therefore, this greedy algorithm is probably not a good idea
LU at all! Courtesy of Henri Casanova
116 / 235
Parallel
Algorithms
A. Legrand
Granularity
Matrix Vector
Product
Open MP
Version
How do we improve on performance?
What really kills performance is that we have to
First MPI
Version
Distributing
Matrices do so much communication
Many bytes of data
Second MPI
Version
Third MPI
Version Many individual messages
Mixed
Parallelism
Version
So we we want is to augment the granularity of
Matrix the algorithm
Multiplication
Our “tasks” are not going to be “update one
A. Legrand
Reducing the Granularity
Matrix Vector
Product
Open MP
Version
A simple approach: have a processor compute k
First MPI
Version
cells in sequence before sending them
Distributing
Matrices
This is in conflict with the “get processors to
Second MPI
Version compute as early as possible” principle we based
Third MPI
Version
our initial greedy algorithm on
Mixed So we will reduce communication cost, but will
Parallelism
Version
increase idle time
Matrix
Multiplication Let use assume that k divides n
Stencil
Application
Each row now consists of n/k segments
Principle If k does not divide n we have left over cells
Greedy Version
Reducing the and it complicates the program and the
performance analysis and as usual doesn’t
Granularity
LU Factorization
Gaussian change the asymptotic performance analysis
Elimination
LU
Courtesy of Henri Casanova
118 / 235
Parallel
Algorithms
A. Legrand
Reducing the Granularity
Matrix Vector
k
Product
Open MP
Version
First MPI
Version
P0 0 1 2 3
Distributing
Matrices
Second MPI
P1 1 2 3 4
Version
Third MPI
Version P2 2 3 4 5
Mixed
Parallelism
Version P3 3 4 5 6
Matrix
P0
Multiplication
4 5 6
Stencil
Application
Principle
Greedy Version
The algorithm computes segment after segment
Reducing the
Granularity
The time before P1 can start computing is the
LU Factorization time for P0 to compute a whole segment
Gaussian
Elimination Therefore, it will take longer until Pp-1 can start
LU
computing Courtesy of Henri Casanova
119 / 235
Parallel
Algorithms Reducing the Granularity
A. Legrand
More
Matrix Vector
Product
Open MP
Version
So far, we’ve allocated non-contiguous rows of
First MPI
Version
the domain to each processor
Distributing
Matrices
But we can reduce communication by allocating
Second MPI
Version processors groups of contiguous rows
Third MPI If two contiguous rows are on the same
Version
Mixed
Parallelism processors, there is no communication
Version
involved to update the cells of the second row
Matrix
Multiplication Let us use say that we allocate blocks of rows of
Stencil size r to each processor
Application
Principle
We assume that r*p divides n
Greedy Version
Reducing the
Processor Pi holds rows j such that
i = floor(j/r) mod p
Granularity
LU Factorization
Gaussian
Elimination
This is really a “block cyclic” allocation
LU
Courtesy of Henri Casanova
120 / 235
Parallel
Algorithms
A. Legrand
Reducing the Granularity
Matrix Vector
Product
Open MP
Version
k
P0
First MPI
r
Version
Distributing 0 1 2 3
Matrices
Second MPI
Version
P1 1 2 3 4
Third MPI
Version
Mixed
P2 2 3 4 5
Parallelism
Version
Matrix
P3 3 4 5 6
Multiplication
Stencil P0 4 5 6 7
Application
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
121 / 235
Parallel
Algorithms
A. Legrand
Idle Time?
Matrix Vector
Product
Open MP
Version One question is: does any processor stay idle?
First MPI
Version
Distributing
Processor P0 computes all values in its first block
Matrices
Second MPI
of rows in n/k algorithm steps
Version
Third MPI After that, processor P0 must wait for cell values
Version
Mixed
Parallelism
from processor Pp-1
Version
Matrix
But Pp-1 cannot start computing before p steps
Multiplication
Stencil
Therefore:
Application If p >= n/k, P0 is idle
Principle
Greedy Version If p < n/k, P1 is not idle
Reducing the
Granularity If p < n/k, then processors had better be able to
LU Factorization
Gaussian buffer received cells while they are still
Elimination
LU computing
Courtesy of Henri Casanova
Possible increase in memory consumption 122 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Matrix Vector
Product
Open MP
Version It is actually very simple
First MPI
Version
Distributing
At each step a processor is involved at most in
Matrices Receiving k cells from its predecessor
Second MPI
Version Sending k cells to its successor
Third MPI
Version
Mixed
Updating k*r cells
Parallelism
Version Since sending and receiving are overlapped, the
Matrix
Multiplication
time to perform a step is L + k b + k r w
Stencil
Question: How many steps?
Application
Principle
Answer: It takes p-1 steps before Pp-1 can start
Greedy Version
Reducing the
doing any thing. Pp-1 holds n2/(pkr) blocks
Granularity
LU Factorization
Execution time:
Gaussian
Elimination T(n,p,r,k) = (p-1 + n2/(pkr)) (L + kb + k r w)
LU
Courtesy of Henri Casanova
123 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Matrix Vector
Product
Open MP
Version
Our naïve greedy algorithm had asymptotic efficiency equal
First MPI to w / (w + L + b)
Version
Distributing
This algorithm does better: Assympt. Eff = w / (w + L/rk +
Matrices b/r)
Second MPI
Version Divide n2w by p T(n,p,r,k)
Third MPI
Version
And make n large
Mixed
Parallelism
In the formula for the efficiency we clearly see the effect of
Version the granularity increase
Matrix Asymptotic efficiency is higher
Multiplication
But not equal to 1
Stencil
Application Therefore, this is a “difficult” application to parallelize
Principle We can try to do the best we can by increasing r and k, but it’s
Greedy Version
Reducing the
never going to be perfect
Granularity One can compute the optimal values of r and k using
LU Factorization numerical solving
Gaussian
Elimination
See the book for details
LU
Courtesy of Henri Casanova
124 / 235
Outline
Parallel
Algorithms 12 Matrix Vector Product
A. Legrand
Open MP Version
Matrix Vector First MPI Version
Product
Open MP Distributing Matrices
Version
First MPI Second MPI Version
Version
Distributing Third MPI Version
Matrices
Second MPI Mixed Parallelism Version
Version
Third MPI
Version 13 Matrix Multiplication
Mixed
Parallelism
Version 14 Stencil Application
Matrix
Multiplication
Principle
Stencil Greedy Version
Application
Principle
Reducing the Granularity
Greedy Version
Reducing the 15 LU Factorization
Granularity
LU Factorization
Gaussian Elimination
Gaussian
Elimination
LU
LU
125 / 235
Parallel
Algorithms
A. Legrand
Solving Linear Systems of Eq.
Matrix Vector
Product
Open MP
Version
Method for solving Linear Systems
First MPI The need to solve linear systems arises in an estimated 75% of all scientific
Version
Distributing
computing problems [Dahlquist 1974]
Matrices Gaussian Elimination is perhaps the most well-known
Second MPI
Version method
Third MPI
Version
based on the fact that the solution of a linear system is
Mixed invariant under scaling and under row additions
Parallelism
Version
One can multiply a row of the matrix by a constant as long as one
multiplies the corresponding element of the right-hand side by the
Matrix same constant
Multiplication One can add a row of the matrix to another one as long as one
Stencil adds the corresponding elements of the right-hand side
Application Idea: scale and add equations so as to transform matrix A in
Principle
Greedy Version
an upper triangular matrix:
?
Reducing the ?
Granularity x ?
? =
LU Factorization ?
?
Gaussian
Elimination
LU equation n-i has i unknowns, with
Courtesy of Henri Casanova
126 / 235
Parallel
Algorithms
A. Legrand
Gaussian Elimination
Matrix Vector
Product
Open MP 1 1 1 0
Version
First MPI
Version
1 -2 2 x= 4
1 2 -1 2
Distributing
Matrices Subtract row 1 from rows 2 and 3
Second MPI
Version 1 1 1 0
Third MPI
Version
0 -3 1 x= 4
Mixed 0 1 -2 2
Parallelism
Version Multiple row 3 by 3 and add row 2
Matrix 1 1 1 0
Multiplication
Stencil
0 -3 1 x= 4
Application 0 0 -5 1
0
Principle
Greedy Version
Reducing the -5x3 = 10 x3 = -2
Granularity
Solving equations in -3x2 + x3 = 4 x2 = -2
LU Factorization
Gaussian reverse order (backsolving) x1 + x2 + x3 = 0 x1 = 4
Elimination
LU
Courtesy of Henri Casanova
127 / 235
Parallel
Algorithms
A. Legrand
Gaussian Elimination
Matrix Vector
Product
Open MP
Version The algorithm goes through the matrix from the
First MPI
Version top-left corner to the bottom-right corner
Distributing
Matrices
Second MPI
the ith step eliminates non-zero sub-diagonal
Version
Third MPI
elements in column i, substracting the ith row
Version
Mixed
scaled by aji/aii from row j, for j=i+1,..,n.
Parallelism
Version
Matrix values already computed
Multiplication
Principle
Greedy Version
Reducing the
0 values yet to be
updated
Granularity
LU Factorization
Gaussian
Elimination
LU
i Courtesy of Henri Casanova
128 / 235
Parallel
Algorithms Sequential Gaussian
A. Legrand
Elimination
Matrix Vector
Product
Open MP
Version
Simple sequential algorithm
First MPI
Version
// for each column i
Distributing
Matrices // zero it out below the diagonal by adding
Second MPI
// multiples of row i to later rows
Version for i = 1 to n1
Third MPI // for each row j below row i
Version for j = i+1 to n
Mixed // add a multiple of row i to row j
Parallelism for k = i to n
Version A(j,k) = A(j,k) (A(j,i)/A(i,i)) * A(i,k)
Matrix
Multiplication Several “tricks” that do not change the spirit of the
Stencil
Application
algorithm but make implementation easier and/or more
Principle efficient
Greedy Version Right-hand side is typically kept in column n+1 of the matrix
Reducing the
Granularity and one speaks of an augmented matrix
LU Factorization
Compute the A(i,j)/A(i,i) term outside of the loop
Gaussian
Elimination
LU
Courtesy of Henri Casanova
129 / 235
Parallel
Algorithms
A. Legrand
Pivoting: Motivation
Matrix Vector
Product
0 1
Open MP
Version
A few pathological cases 1 1
First MPI
Version
Distributing Division by small numbers → round-off error in computer
Matrices
Second MPI arithmetic
Version
Third MPI
Consider the following system
Version 0.0001x1 + x2 = 1.000
Mixed
Parallelism x1 + x2 = 2.000
Version
Matrix
exact solution: x1=1.00010 and x2 = 0.99990
Multiplication
say we round off after 3 digits after the decimal point
Stencil
Application
Multiply the first equation by 104 and subtract it from the second
Principle equation
Greedy Version
Reducing the (1 - 1)x1 + (1 - 104)x2 = 2 - 104
Granularity
LU Factorization
But, in finite precision with only 3 digits:
Gaussian 1 - 104 = -0.9999 E+4 ~ -0.999 E+4
Elimination
LU
2 - 104 = -0.9998 E+4 ~ -0.999 E+4
Courtesy of Henri Casanova
Therefore, x2 = 1 and x1 = 0 (from the first equation) 130 / 235
Parallel
Algorithms
A. Legrand
Partial Pivoting
Matrix Vector
Product
Open MP
Version
One can just swap rows
First MPI x1 + x2 = 2.000
Version
Distributing
0.0001x1 + x2 = 1.000
Matrices Multiple the first equation my 0.0001 and subtract it from the second
Second MPI
Version
equation gives:
Third MPI (1 - 0.0001)x2 = 1 - 0.0001
Version 0.9999 x2 = 0.9999 => x2 = 1
Mixed
Parallelism and then x1 = 1
Version
Final solution is closer to the real solution. (Magical?)
Matrix
Multiplication Partial Pivoting
For numerical stability, one doesn’t go in order, but pick the next row in rows i to
Stencil n that has the largest element in row i
Application This row is swapped with row i (along with elements of the right hand side)
Principle before the subtractions
Greedy Version
the swap is not done in memory but rather one keeps an indirection array
Reducing the
Granularity
Total Pivoting
Look for the greatest element ANYWHERE in the matrix
LU Factorization Swap columns
Gaussian
Elimination
Swap rows
LU
Courtesy of Henri Casanova
Numerical stability is really a difficult field 131 / 235
Parallel
Algorithms Parallel Gaussian
A. Legrand
Elimination?
Matrix Vector
Product
Open MP
Version
Assume that we have one processor per matrix element
First MPI
Version
Distributing
Matrices
Second MPI
Version max aji needed to compute
Third MPI to find the max aji the scaling factor Independent computation
Version
Mixed
of the scaling factor
Parallelism
Version Reduction Broadcast Compute
Matrix
Multiplication
Stencil
Application Every update needs the
Independent
Principle scaling factor and the
computations
Greedy Version element from the pivot row
Reducing the
Granularity
LU Factorization
Broadcasts Compute
Gaussian
Elimination
LU
Courtesy of Henri Casanova
132 / 235
Parallel
Algorithms
A. Legrand
LU Factorization (Section 4.4)
Matrix Vector
Product
Open MP Gaussian Elimination is simple but
Version What if we have to solve many Ax = b systems for different values of b?
First MPI
Version
This happens a LOT in real applications
Distributing Another method is the “LU Factorization”
Matrices
Second MPI
Ax = b
Version Say we could rewrite A = L U, where L is a lower triangular matrix, and U is
Third MPI an upper triangular matrix O(n3)
Version
Mixed
Then Ax = b is written L U x = b
Parallelism Solve L y = b O(n2)
Version
Solve U x = y O(n2)
Matrix
Multiplication
Stencil
Application triangular system solves are easy
Principle
Greedy Version ? ?
Reducing the ? ?
Granularity x ?
? = x ?
? =
LU Factorization ? ?
? ?
Gaussian
Elimination
LU equation i has i unknowns equation n-i has i unknowns
Courtesy of Henri Casanova
133 / 235
Parallel
Algorithms
A. Legrand
LU Factorization: Principle
Matrix Vector
Product
Open MP
Version
It works just like the Gaussian Elimination, but instead of zeroing
First MPI out elements, one “saves” scaling coefficients.
Version
Distributing
Matrices
gaussian
Second MPI
Version
1 2 - 1 2 - 1 2 - elimination 1 2 -
1 1 save the
Third MPI 4 3 1 gaussian 0 - 5 scaling 4 - 1
5 +
4 - 1
5
Version elimination save the
2 2 3 2 5
2 3
factor 5 scaling 5-
Mixed 2 2 3 2 5
Parallelism factor
Version 2
gaussian
Matrix
elimination 1 2 -1
Multiplication 1 0 0 1 2 -1
+
4 -5 5
Stencil save the
scaling 2/5
L= 4 1 0 U= 0 -5 5
Application 2 3
factor 2 2/5 1 0 0 3
Principle
Magically, A = L x U !
Greedy Version
Reducing the
Granularity
LU Factorization
Should be done with pivoting as well
Gaussian
Elimination
LU
Courtesy of Henri Casanova
134 / 235
Parallel
Algorithms
A. Legrand
LU Factorization
Matrix Vector
Product
Open MP
Version
We’re going to look at the simplest possible version
First MPI No pivoting:just creates a bunch of indirections that are easy but make
Version
the code look complicated without changing the overall principle
Distributing
Matrices
Second MPI
Version LUsequential(A,n) {
Third MPI for k = 0 to n2 {
Version stores the scaling factors
Mixed // preparing column k
Parallelism for i = k+1 to n1
Version
aik ← aik / akk
Matrix
Multiplication
for j = k+1 to n1
// Task Tkj: update of column j
Stencil for i=k+1 to n1
Application
aij ← aij + aik * akj
k
Principle
Greedy Version } k
Reducing the }
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
135 / 235
Parallel
Algorithms
A. Legrand
LU Factorization
Matrix Vector
Product
Open MP
Version
We’re going to look at the simplest possible version
First MPI
Version
No pivoting:just creates a bunch of indirections that are easy
Distributing but make the code look complicated without changing the
Matrices
overall principle
Second MPI
Version LUsequential(A,n) {
Third MPI for k = 0 to n2 {
Version
Mixed // preparing column k
Parallelism for i = k+1 to n1
Version
aik ← aik / akk
Matrix
Multiplication
for j = k+1 to n1
// Task Tkj: update of column j
Stencil for i=k+1 to n1 update
Application
aij ← aij + aik * akj
Principle k
Greedy Version }
Reducing the } k
Granularity
LU Factorization j
Gaussian
Elimination
i
LU
Courtesy of Henri Casanova
136 / 235
Parallel
Algorithms
A. Legrand
Parallel LU on a ring
Matrix Vector
Product
Open MP
Version
Since the algorithm operates by columns from left to right,
First MPI
Version
we should distribute columns to processors
Distributing
Matrices
Principle of the algorithm
Second MPI At each step, the processor that owns column k does the
Version
Third MPI
“prepare” task and then broadcasts the bottom part of column
Version k to all others
Mixed
Parallelism
Annoying if the matrix is stored in row-major fashion
Version
Remember that one is free to store the matrix in anyway one
Matrix wants, as long as it’s coherent and that the right output is
Multiplication
generated
Stencil
Application
After the broadcast, the other processors can then update
Principle their data.
Greedy Version
Reducing the
Assume there is a function alloc(k) that returns the rank of
Granularity the processor that owns column k
LU Factorization Basically so that we don’t clutter our program with too many
Gaussian
Elimination global-to-local index translations
LU In fact, we will first write everything in terms of global
Courtesy of Henri Casanova
indices, as to avoid all annoying index arithmetic 137 / 235
Parallel
Algorithms
A. Legrand
LU-broadcast algorithm
Matrix Vector
Product
Open MP LUbroadcast(A,n) {
Version
First MPI q ← MY_NUM()
Version p ← NUM_PROCS()
Distributing
Matrices for k = 0 to n2 {
Second MPI if (alloc(k) == q)
Version
Third MPI // preparing column k
Version for i = k+1 to n1
Mixed
Parallelism buffer[ik1] ← aik ← aik / akk
Version
broadcast(alloc(k),buffer,nk1)
Matrix
Multiplication
for j = k+1 to n1
if (alloc(j) == q)
Stencil
Application
// update of column j
Principle for i=k+1 to n1
Greedy Version aij ← aij + buffer[ik1] * akj
Reducing the
Granularity }
LU Factorization
}
Gaussian
Elimination
LU
Courtesy of Henri Casanova
138 / 235
Parallel
Algorithms
A. Legrand
Dealing with local indices
Matrix Vector
Product
Open MP
Version Assume that p divides n
First MPI
Version
Distributing
Each processor needs to store r=n/p columns and
Matrices
Second MPI
its local indices go from 0 to r-1
Version
Third MPI
After step k, only columns with indices greater
Version
Mixed than k will be used
Parallelism
Version Simple idea: use a local index, l, that everyone
Matrix
Multiplication initializes to 0
Stencil At step k, processor alloc(k) increases its local
Application
Principle index so that next time it will point to its next
Greedy Version
Reducing the
local column
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
139 / 235
Parallel
Algorithms
A. Legrand
LU-broadcast algorithm
Matrix Vector
Product
Open MP ...
Version
First MPI double a[n1][r1];
Version
Distributing
Matrices q ← MY_NUM()
Second MPI p ← NUM_PROCS()
Version
Third MPI l ← 0
Version for k = 0 to n2 {
Mixed
Parallelism if (alloc(k) == q)
Version
for i = k+1 to n1
Matrix buffer[ik1] ← a[i,k] ← a[i,l] / a[k,l]
Multiplication
l ← l+1
Stencil broadcast(alloc(k),buffer,nk1)
Application
Principle
for j = l to r1
Greedy Version for i=k+1 to n1
Reducing the
Granularity
a[i,j] ← a[i,j] + buffer[ik1] * a[k,j]
}
LU Factorization
Gaussian
}
Elimination
LU
Courtesy of Henri Casanova
140 / 235
Parallel
Algorithms What about the Alloc
A. Legrand
function?
Matrix Vector
Product
Open MP
Version
One thing we have left completely unspecified is
First MPI
Version
how to write the alloc function: how are columns
Distributing
Matrices
distributed among processors
Second MPI
Version
There are two complications:
Third MPI
Version
The amount of data to process varies throughout the
Mixed algorithm’s execution
Parallelism
Version
At step k, columns k+1 to n-1 are updated
Matrix
Fewer and fewer columns to update
Multiplication The amount of computation varies among columns
Stencil
e.g., column n-1 is updated more often than column 2
Application
Principle
Holding columns on the right of the matrix leads to much
Greedy Version more work
Reducing the
Granularity
There is a strong need for load balancing
LU Factorization All processes should do the same amount of work
Gaussian
Elimination
LU
Courtesy of Henri Casanova
141 / 235
Parallel
Algorithms
A. Legrand
Bad load balancing
Matrix Vector
Product
P1 P2 P3 P4
Open MP
Version
First MPI
Version
Distributing
Matrices
Second MPI
Version
Third MPI
Version
Mixed
already
Parallelism
Version done
Matrix
Multiplication
Stencil already
done
Application
Principle working
on it
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
142 / 235
Parallel
Algorithms
A. Legrand
Good Load Balancing?
Matrix Vector
Product
Open MP
Version
First MPI
Version
Distributing
Matrices
Second MPI
Version
Third MPI
already
Version
Mixed
done
Parallelism
Version
Matrix
Multiplication already
Stencil done
Application
Principle
working
Greedy Version on it
Reducing the
Granularity
LU Factorization
Cyclic distribution
Gaussian
Elimination
LU
Courtesy of Henri Casanova
143 / 235
Parallel
Algorithms Proof that load balancing is
A. Legrand
good
Matrix Vector
Product
Open MP
Version
The computation consists of two types of operations
First MPI column preparations
Version
Distributing
matrix element updates
Matrices
Second MPI
There are many more updates than preparations, so we really
Version care about good balancing of the preparations
Third MPI
Version Consider column j
Mixed
Parallelism
Let’s count the number of updates performed by the processor
Version holding column j
Matrix
Multiplication
Column j is updated at steps k=0, ..., j-1
At step k, elements i=k+1, ..., n-1 are updates
Stencil
Application indices start at 0
Principle Therefore, at step k, the update of column j entails n-k-1 updates
Greedy Version
Reducing the The total number of updates for column j in the execution is:
Granularity
LU Factorization
Gaussian
Elimination
LU
Courtesy of Henri Casanova
144 / 235
Parallel
Algorithms Proof that load balancing is
A. Legrand
good
Matrix Vector
Product
Open MP
Version
Consider processor Pi, which holds columns lp+i for l=0, ... , n/p -1
First MPI Processor Pi needs to perform this many updates:
Version
Distributing
Matrices
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version Turns out this can be computed
Matrix separate terms
Multiplication use formulas for sums of integers and sums of squares
Stencil What it all boils down to is:
Application
Principle
Greedy Version
Reducing the
Granularity
LU Factorization This does not depend on i !!
Gaussian
Elimination
Therefore it is (asymptotically) the same for all Pi processors
LU Therefore we have (asymptotically) perfect load balancing!Courtesy of Henri Casanova
145 / 235
Parallel
Algorithms
A. Legrand
Load-balanced program
Matrix Vector
Product
Open MP ...
Version
First MPI double a[n1][r1];
Version
Distributing
Matrices q ← MY_NUM()
Second MPI p ← NUM_PROCS()
Version
Third MPI l ← 0
Version for k = 0 to n2 {
Mixed
Parallelism if (k mod p == q)
Version
for i = k+1 to n1
Matrix buffer[ik1] ← a[i,k] ← a[i,l] / a[k,l]
Multiplication
l ← l+1
Stencil broadcast(alloc(k),buffer,nk1)
Application
Principle
for j = l to r1
Greedy Version for i=k+1 to n1
Reducing the
Granularity
a[i,j] ← a[i,j] + buffer[ik1] * a[k,j]
}
LU Factorization
Gaussian
}
Elimination
LU
Courtesy of Henri Casanova
146 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Matrix Vector
Product
Open MP
Version
How long does this code take to run?
First MPI
Version This is not an easy question because there are
Distributing
Matrices many tasks and many communications
Second MPI
Version A little bit of analysis shows that the execution
Third MPI
Version time is the sum of three terms
Mixed
Parallelism
n-1 communications: n L + (n2/2) b + O(1)
Version n-1 column preparations: (n2/2) w’ + O(1)
Matrix
Multiplication
column updates: (n3/3p) w + O(n2)
Stencil Therefore, the execution time is ~ (n3/3p) w
Application
Principle
Note that the sequential time is: (n3 /3) w
Greedy Version
Reducing the
Therefore, we have perfect asymptotic efficiency!
Granularity
LU Factorization
This is good, but isn’t always the best in practice
Gaussian
Elimination
How can we improve this algorithm?
LU
Courtesy of Henri Casanova
147 / 235
Parallel
Algorithms
A. Legrand
Pipelining on the Ring
Matrix Vector
Product
Open MP
Version
So far, the algorithm we’ve used a simple
First MPI
Version
broadcast
Distributing
Matrices
Nothing was specific to being on a ring of
Second MPI
Version
processors and it’s portable
Third MPI
Version
in fact you could just write raw MPI that just looks like
Mixed our pseudo-code and have a very limited, inefficient for
Parallelism
Version small n, LU factorization that works only for some
Matrix
number of processors
Multiplication But it’s not efficient
Stencil
Application
The n-1 communication steps are not overlapped with
Principle computations
Greedy Version Therefore Amdahl’s law, etc.
Reducing the
Granularity Turns out that on a ring, with a cyclic distribution
LU Factorization
Gaussian
of the columns, one can interleave pieces of the
Elimination broadcast with the computation
LU
It almost looks like inserting the source code from the
Courtesy of Henri Casanova
148 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
Part V
Network
(General Case)
149 / 235
The Context: Distributed Heterogeneous Platforms
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network How to embed a ring in a complex network [LRRV04].
(Complete)
Heterogeneous
Network
Sources of problems
(General Case)
I Heterogeneity of processors (computational power, memory, etc.)
I Heterogeneity of communications links.
I Irregularity of interconnection network.
150 / 235
Targeted Applications: Iterative Algorithms
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
I A set of data (typically, a matrix)
Network
(Complete)
Heterogeneous
I Structure of the algorithms:
Network
(General Case)
1 Each processor performs a computation on its chunk of data
2 Each processor exchange the “border” of its chunk of data with
its neighbor processors
3 We go back at Step 1
151 / 235
Targeted Applications: Iterative Algorithms
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
I A set of data (typically, a matrix)
Network
(Complete)
Heterogeneous
I Structure of the algorithms:
Network
(General Case)
1 Each processor performs a computation on its chunk of data
2 Each processor exchange the “border” of its chunk of data with
its neighbor processors
3 We go back at Step 1
Question: how can we efficiently execute such an algorithm on such
a platform?
151 / 235
The Questions
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
Network
I Which processors should be used ?
(General Case)
I What amount of data should we give them ?
I How do we cut the set of data ?
152 / 235
First of All, a Simplification: Slicing the Data
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case)
P3 P4
153 / 235
First of All, a Simplification: Slicing the Data
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case) P1 P2 P4 P3
P3 P4
I Unidimensional cutting into vertical slices
153 / 235
First of All, a Simplification: Slicing the Data
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case) P1 P2 P4 P3
P3 P4
I Unidimensional cutting into vertical slices
I Consequences:
153 / 235
First of All, a Simplification: Slicing the Data
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case) P1 P2 P4 P3
P3 P4
I Unidimensional cutting into vertical slices
I Consequences:
1 Borders and neighbors are easily defined
153 / 235
First of All, a Simplification: Slicing the Data
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case) P1 P2 P4 P3
P3 P4
I Unidimensional cutting into vertical slices
I Consequences:
1 Borders and neighbors are easily defined
2 Constant volume of data exchanged between neighbors: Dc
153 / 235
First of All, a Simplification: Slicing the Data
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case) P1 P2 P4 P3
P3 P4
I Unidimensional cutting into vertical slices
I Consequences:
1 Borders and neighbors are easily defined
2 Constant volume of data exchanged between neighbors: Dc
3 Processors are virtually organized into a ring
153 / 235
Notations
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network I Processors: P1 , ..., Pp
Heterogeneous
Network
(Complete)
I Processor Pi executes a unit task in a time wi
Heterogeneous
Network
(General Case)
I Overall amount of work Dw ;
Share of P
Pi : αi .Dw processed in a time αi .Dw .wi
(αi > 0, j αj = 1)
154 / 235
Communications: 1-Port Model (Full-Duplex)
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete) A processor can:
Heterogeneous
Network
(General Case) I send at most one message at any time;
I receive at most one message at any time;
I send and receive a message simultaneously.
155 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
1 Select q processors among p
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case)
156 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
1 Select q processors among p
Heterogeneous
Network
(Complete)
2 Order them into a ring
Heterogeneous
Network
(General Case)
156 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
1 Select q processors among p
Heterogeneous
Network
(Complete)
2 Order them into a ring
Heterogeneous
Network
3 Distribute the data among them
(General Case)
156 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
1 Select q processors among p
Heterogeneous
Network
(Complete)
2 Order them into a ring
Heterogeneous
Network
3 Distribute the data among them
(General Case)
So as to minimize:
156 / 235
Special Hypotheses
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
1 There exists a communication link between any two processors
(Complete)
Heterogeneous
2 All links have the same capacity
Network
(General Case) (∃c, ∀i, j ci,j = c)
157 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
I Either the most powerful processor performs all the work, or all
Heterogeneous
Network
the processors participate
(Complete)
Heterogeneous
Network
(General Case)
158 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
I Either the most powerful processor performs all the work, or all
Heterogeneous
Network
the processors participate
(Complete)
Heterogeneous I If all processors participate, all end their share of work simultane-
Network
(General Case) ously
158 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
I Either the most powerful processor performs all the work, or all
Heterogeneous
Network
the processors participate
(Complete)
Heterogeneous I If all processors participate, all endP
their share of work simultane-
Network
(General Case) ously(∃τ, αi .Dw .wi = τ , so 1 = i Dwτ.wi )
158 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
I Either the most powerful processor performs all the work, or all
Heterogeneous
Network
the processors participate
(Complete)
Heterogeneous I If all processors participate, all endP
their share of work simultane-
Network
(General Case) ously(∃τ, αi .Dw .wi = τ , so 1 = i Dwτ.wi )
I Time of the optimal solution:
( )
1
Tstep = min Dw .wmin , Dw . P 1 + 2.Dc .c
i wi
158 / 235
Special hypothesis
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
1 There exists a communication link between any two processors
Network
(General Case)
159 / 235
All the Processors Participate: Study (1)
Parallel
Algorithms
A. Legrand
160 / 235
All the Processors Participate: Study (2)
Parallel
Algorithms
A. Legrand
The Problem
Fully I All processors end simultaneously
Homogeneous
Network
Heterogeneous
Network
(Complete)
Tstep = αi .Dw .wi + Dc .(ci,succ(i) + ci,pred(i) )
Heterogeneous
Network
(General Case)
161 / 235
All the Processors Participate: Study (2)
Parallel
Algorithms
A. Legrand
The Problem
Fully I All processors end simultaneously
Homogeneous
Network
Heterogeneous
Network
(Complete)
Tstep = αi .Dw .wi + Dc .(ci,succ(i) + ci,pred(i) )
Heterogeneous
Network p p
(General Case) X X Tstep − Dc .(ci,succ(i) + ci,pred(i) )
I αi = 1 ⇒ = 1. Thus
Dw .wi
i=1 i=1
p
Tstep Dc X ci,succ(i) + ci,pred(i)
=1+
Dw .wcumul Dw wi
i=1
where wcumul = P1 1
i wi
161 / 235
All the Processors Participate: Interpretation
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Homogeneous Tstep Dc X ci,succ(i) + ci,pred(i)
Network
=1+
Heterogeneous
Network Dw .wcumul Dw wi
(Complete) i=1
Heterogeneous
Network
(General Case)
162 / 235
All the Processors Participate: Interpretation
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Homogeneous Tstep Dc X ci,succ(i) + ci,pred(i)
Network
=1+
Heterogeneous
Network Dw .wcumul Dw wi
(Complete) i=1
Heterogeneous
Network
(General Case)
p
X ci,succ(i) + ci,pred(i)
Tstep is minimal when is minimal
wi
i=1
162 / 235
All the Processors Participate: Interpretation
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Homogeneous Tstep Dc X ci,succ(i) + ci,pred(i)
Network
=1+
Heterogeneous
Network Dw .wcumul Dw wi
(Complete) i=1
Heterogeneous
Network
(General Case)
p
X ci,succ(i) + ci,pred(i)
Tstep is minimal when is minimal
wi
i=1
162 / 235
All the Processors Participate: Interpretation
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Homogeneous Tstep Dc X ci,succ(i) + ci,pred(i)
Network
=1+
Heterogeneous
Network Dw .wcumul Dw wi
(Complete) i=1
Heterogeneous
Network
(General Case)
p
X ci,succ(i) + ci,pred(i)
Tstep is minimal when is minimal
wi
i=1
NP-complete problem
162 / 235
All the Processors Participate: Linear Program
Parallel
Algorithms
A. Legrand
The Problem
Fully Pp Pp
Homogeneous
Network Minimize i=1 j=1 di,j .xi,j ,
Heterogeneous
Network
(Complete)
Heterogeneous
satisfying the (in)equations
Network
(General Case) Pp
(1) xi,j = 1 16i 6p
Pj=1p
(2) x = 1 16j 6p
i=1 i,j
(3) xi,j ∈ {0, 1} 1 6 i, j 6 p
(4) ui − uj + p.xi,j 6 p − 1 2 6 i, j 6 p, i 6= j
(5) ui integer, ui > 0 26i 6p
163 / 235
General Case: Linear program
Parallel
Algorithms
A. Legrand
Best ring made of q processors
The Problem
Fully
Homogeneous Minimize T satisfying the (in)equations
Network
Heterogeneous (1) i,j ∈ {0, 1}
xP 1 6 i, j 6 p
Network p
(Complete) (2) xi,j 6 1 16j 6p
Heterogeneous
Ppi=1 P p
(3) j=1 xP
i,j = q
Network
(General Case)
Pi=1
p p
(4) i=1 xi,j = i=1 xj,i 16j 6p
Pp
(5) i=1Pαi = 1
(6) αi 6 pj=1 xi,j 16i 6p
Pp
(7) αi .wi + DDc
j=1 (xi,j ci,j + xj,i cj,i ) 6 T 16i 6p
w
p
P
(8) i=1 yi = 1
(9) − p.yi − p.yj + ui − uj + q.xi,j 6 q − 1 1 6 i, j 6 p, i 6= j
(10) yi ∈ {0, 1} 16i 6p
(11) ui integer, ui > 0 16i 6p
164 / 235
Linear Programming
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete) I Problems with rational variables: can be solved in polynomial time
Heterogeneous
Network
(General Case)
(in the size of the problem).
I Problems with integer variables: solved in exponential time in the
worst case.
I No relaxation in rationals seems possible here. . .
165 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous All processors participate. One can use a heuristic to solve the
Network
Heterogeneous
Network
traveling salesman problem (as Lin-Kernighan’s one)
(Complete)
Heterogeneous
Network
(General Case)
166 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous All processors participate. One can use a heuristic to solve the
Network
Heterogeneous
Network
traveling salesman problem (as Lin-Kernighan’s one)
(Complete)
Heterogeneous
No guarantee, but excellent results in practice.
Network
(General Case)
166 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous All processors participate. One can use a heuristic to solve the
Network
Heterogeneous
Network
traveling salesman problem (as Lin-Kernighan’s one)
(Complete)
Heterogeneous
No guarantee, but excellent results in practice.
Network
(General Case)
General case.
1 Exhaustive search: feasible until a dozen of processors. . .
2 Greedy heuristic: initially we take the best pair of processors; for
a given ring we try to insert any unused processor in between any
pair of neighbor processors in the ring. . .
166 / 235
New Difficulty: Communication Links Sharing
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
Heterogeneous platform Virtual ring
167 / 235
New Difficulty: Communication Links Sharing
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
Heterogeneous platform Virtual ring
167 / 235
New Difficulty: Communication Links Sharing
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
Heterogeneous platform Virtual ring
167 / 235
New Difficulty: Communication Links Sharing
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
Heterogeneous platform Virtual ring
167 / 235
New Notations
Parallel
Algorithms
A. Legrand
The Problem
Fully
I A set of communications links: e1 , ..., en
Homogeneous
Network
Heterogeneous
I Bandwidth of link em : bem
Network
(Complete) I There is a path Si from Pi to Psucc(i) in the network
Heterogeneous
Network
(General Case) I Si uses a fraction si,m of the bandwidth bem of link em
1
I Pi needs a time Dc . to send to its successor a mes-
minem ∈Si si,m
sage of size Dc X
I Constraints on the bandwidth of em : si,m 6 bem
16i6p
168 / 235
Toy Example: Choosing the Ring
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
P1 Q
Network
Heterogeneous
a R c d
Network P2 P3
(Complete)
Heterogeneous
h e
Network
(General Case)
g b f
P5 P4
169 / 235
Toy Example: Choosing the Ring
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
P1 Q
Network
Heterogeneous
a R c d
Network P2 P3
(Complete)
Heterogeneous
h e
Network
(General Case)
g b f
P5 P4
169 / 235
Toy Example: Choosing the Paths
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4
170 / 235
Toy Example: Choosing the Paths
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4
170 / 235
Toy Example: Choosing the Paths
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4
170 / 235
Toy Example: Choosing the Paths
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4
170 / 235
Toy Example: Bandwidth Sharing
Parallel
Algorithms
A. Legrand
1
The Problem
Fully
From P1 to P2 we use links a and b: c1,2 = min(s1,a ,s1,b ) .
Homogeneous
1
Network
Heterogeneous
From P1 to P5 we use the link h: c1,5 = p1,h .
Network
(Complete)
Heterogeneous
Network
(General Case)
171 / 235
Toy Example: Bandwidth Sharing
Parallel
Algorithms
A. Legrand
1
The Problem
Fully
From P1 to P2 we use links a and b: c1,2 = min(s1,a ,s1,b ) .
Homogeneous
1
Network
Heterogeneous
From P1 to P5 we use the link h: c1,5 = p1,h .
Network
(Complete)
Heterogeneous Set of all sharing constraints:
Network
(General Case) Lien a: s1,a 6 ba
Lien b: s1,b + s4,b + p2,b + p5,b 6 bb
Lien c: s2,c 6 bc
Lien d: s2,d + s3,d + p3,d + p4,d 6 bd
Lien e: s3,e + p3,e + p4,e 6 be
Lien f : s4,f + p3,f + p5,f 6 bf
Lien g : s4,g + p2,g + p5,g 6 bg
Lien h: s5,h + p1,h + p2,h 6 bh
171 / 235
Toy Example: Final Quadratic System
Parallel
Algorithms
A. Legrand
The Problem
Fully Minimize max16i65 (αi .Dw .wi + Dc .(ci,i−1 + ci,i+1 )) under the constraints
Homogeneous
Network
Heterogeneous P5
Network i=1 αi = 1
(Complete)
s1,a 6 ba s1,b + s4,b + p2,b + p5,b 6 bb s2,c 6 bc
Heterogeneous
s2,d + s3,d + p3,d + p4,d 6 bd s3,e + p3,e + p4,e 6 be s4,f + p3,f + p5,f 6 bf
Network
(General Case)
s4,g + p2,g + p5,g 6 bg s5,h + p1,h + p2,h 6 bh
s1,a .c1,2 > 1 s1,b .c1,2 > 1 p1,h .c1,5 > 1
s2,c .c2,3 > 1 s2,d .c2,3 > 1 p2,b .c2,1 > 1
p2,g .c2,1 > 1 p2,h .c2,1 > 1 s3,d .c3,4 > 1
s3,e .c3,4 > 1 p3,d .c3,2 > 1 p3,e .c3,2 > 1
p3,f .c3,2 > 1 s4,f .c4,5 > 1 s4,b .c4,5 > 1
s4,g .c4,5 > 1 p4,e .c4,3 > 1 p4,d .c4,3 > 1
s5,h .c5,1 > 1 p5,g .c5,4 > 1 p5,b .c5,4 > 1
p5,f .c5,4 > 1
172 / 235
Toy Example: Conclusion
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
The problem sums up to a quadratic system if
Network
Heterogeneous
1 The processors are selected;
Network
(Complete)
Heterogeneous
2 The processors are ordered into a ring;
Network
(General Case) 3 The communication paths between the processors are known.
In other words: a quadratic system if the ring is known.
173 / 235
Toy Example: Conclusion
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
The problem sums up to a quadratic system if
Network
Heterogeneous
1 The processors are selected;
Network
(Complete)
Heterogeneous
2 The processors are ordered into a ring;
Network
(General Case) 3 The communication paths between the processors are known.
In other words: a quadratic system if the ring is known.
173 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully We adapt our greedy heuristic:
Homogeneous
Network
Heterogeneous
1 Initially: best pair of processors
Network
(Complete) 2 For each processor Pk (not already included in the ring)
Heterogeneous
Network
(General Case)
I For each pair (Pi , Pj ) of neighbors in the ring
1 We build the graph of the unused bandwidths
(Without considering the paths between Pi and Pj )
2 We compute the shortest paths (in terms of bandwidth) between
Pk and Pi and Pj
3 We evaluate the solution
3 We keep the best solution found at step 2 and we start again
+ refinements (max-min fairness, quadratic solving).
174 / 235
Is This Meaningful ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
I No guarantee, neither theoretical, nor practical
Heterogeneous
Network
(General Case)
I Simple solution:
1 we build the complete graph whose edges are labeled with the
bandwidths of the best communication paths
2 we apply the heuristic for complete graphs
3 we allocate the bandwidths
175 / 235
Example: an Actual Platform (Lyon)
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous moby canaria
router backbone
routlhpc
Network
sci2
Heterogeneous sci1
Network myri1
sci3
(Complete)
Hub mryi0 popc0 sci0 Switch
Heterogeneous
Network sci4
myri2
(General Case) sci6
sci5
Hub
Topology
P0 P1 P2 P3 P4 P5 P6 P7 P8
0.0206 0.0206 0.0206 0.0206 0.0291 0.0206 0.0087 0.0206 0.0206
176 / 235
Results
Parallel
Algorithms First heuristic building the ring without taking link sharing into ac-
A. Legrand count
The Problem
Fully
Second heuristic taking into account link sharing (and with quadratic
Homogeneous
Network
programing)
Heterogeneous
Network
(Complete)
Heterogeneous
Ratio Dc /Dw H1 H2 Gain
Network
(General Case) 0.64 0.008738 (1) 0.008738 (1) 0%
0.064 0.018837 (13) 0.006639 (14) 64.75%
0.0064 0.003819 (13) 0.001975 (14) 48.28%
Ratio Dc /Dw H1 H2 Gain
0.64 0.005825 (1) 0.005825 (1) 0%
0.064 0.027919 (8) 0.004865 (6) 82.57%
0.0064 0.007218 (13) 0.001608 (8) 77.72%
Table: Tstep /Dw for each heuristic on Lyon’s and Strasbourg’s platforms (the
numbers in parentheses show the size of the rings built).
177 / 235
Conclusion
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous Even though this is a very basic application, it illustrates many diffi-
Network
Heterogeneous culties encountered when:
Network
(Complete)
Heterogeneous
I Processors have different characteristics
Network
(General Case) I Communications links have different characteristics
I There is an irregular interconnection network with complex band-
width sharing issues.
We need to use a realistic model of networks... Even though a more
realistic model leads to a much more complicated problem, this is worth
the effort as derived solutions are more efficient in practice.
178 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
Cannon
Part VI
Fox
Snyder
Data
Distribution Algorithms on a Grid
179 / 235
Outline
Parallel
Algorithms
A. Legrand
Communications
Matrix 16 Communications
Multiplication
Outer Product
Grid Rocks!
Cannon
Fox 17 Matrix Multiplication
Snyder
Data Outer Product
Distribution
Grid Rocks!
Cannon
Fox
Snyder
Data Distribution
180 / 235
Parallel
Algorithms
A. Legrand
2-D Grid (Chapter 5)
Communications
Matrix
Multiplication
P0,0 P0,1 P0,2
Outer Product
Grid Rocks!
Cannon P1,0 P1,1 P1,2
Fox
Snyder
Data
Distribution P2,0 P2,1 P2,2
J: processor column
A. Legrand
2-D Torus
Communications
Matrix
Multiplication
Outer Product
P0,0 P0,1 P0,2
Grid Rocks!
Cannon
Fox P1,0 P1,1 P1,2
Snyder
Data
Distribution
P2,0 P2,1 P2,2
Matrix
When developing performance models we will
Multiplication
Outer Product
assume that a processor can do all three activities in
Grid Rocks! parallel
Cannon
Compute
Fox
Snyder
Send
Data
Distribution
Receive
Matrix
Multiplication Now that we have 4 (logical) links at each
processor, we need to decide how many
Outer Product
Grid Rocks!
Cannon
Fox
concurrent communications can happen at the
Snyder
Data
same time
Distribution There could be 4 sends and 4 receives in the
A. Legrand
So what?
Communications
Matrix
Multiplication We have many options:
Outer Product
Grid Rocks!
Grid or torus
Cannon
Fox
Mono- or bi-directional
Snyder
Data
Single-or multi-port
Distribution Half- or full-duplex
We’ll mostly use the torus, bi-directional, full-
duplex assumption
We’ll discuss the multi-port and the single-port
assumptions
As usual, it’s straightforward to modify the
performance analyses to match with whichever
assumption makes sense for the physical
platform
Courtesy of Henri Casanova
185 / 235
Parallel
Algorithms How realistic is a grid
A. Legrand
topology?
Communications
Matrix
Multiplication Some parallel computers are built as
physical grids (2-D or 3-D)
Outer Product
Grid Rocks!
Cannon
Fox
Example: IBM’s Blue Gene/L
Snyder
Data
Distribution
If the platform uses a switch with all-to-all
communication links, then a grid is
actually not a bad assumption
Although the full-duplex or multi-port
assumptions may not hold
We will see that even if the physical
platform is a shared single medium (e.g.,
a non-switched Ethernet), it’s sometimes
preferable to think of it as a grid when
developing algorithms! Courtesy of Henri Casanova
186 / 235
Parallel
Algorithms
A. Legrand
Communication on a Grid
Communications
Matrix
Multiplication As usual we won’t write MPI here, but
some pseudo code
Outer Product
Grid Rocks!
Cannon
Fox A processor can call two functions to
known where it is in the grid:
Snyder
Data
Distribution
My_Proc_Row()
My_Proc_Col()
A processor can find out how many
processors there are in total by:
Num_Procs()
Recall that here we assume we have a square
grid
In programming assignment we may need to
use a rectangular grid Courtesy of Henri Casanova
187 / 235
Parallel
Algorithms
A. Legrand
Communication on the Grid
Communications
Matrix
Multiplication We have two point-to-point functions
Outer Product
Grid Rocks! Send(dest, addr, L)
Cannon
Fox Recv(src, addr, L)
Snyder
Data
Distribution
We will see that it’s convenient to have
broadcast algorithms within processor
rows or processor columns
BroadcastRow(i, j, srcaddr, dstaddr, L)
BroadcastCol(i, j, srcaddr, dstaddr, L)
We assume that a a call to these functions by
a processor not on the relevant processor row
or column simply returns immediately
How do we implement these broadcasts?
Courtesy of Henri Casanova
188 / 235
Parallel
Algorithms
A. Legrand
Row and Col Broadcasts?
Communications
Matrix
Multiplication If we have a torus
Outer Product
Grid Rocks!
If we have mono-directional links, then we can reuse the
Cannon broadcast that we developed on a ring of processors
Fox Either pipelined or not
Snyder
Data It we have bi-directional links AND a multi-port model,
Distribution
we can improved performance by going both-ways
simultaneously on the ring
We’ll see that the asymptotic performance is not changed
If we have a grid
If links are bi-directional then messages can be sent
both ways from the source processor
Either concurrently or not depending on whether we have a
one-port or multi-port model
If links are mono-directional, then we can’t implement
the broadcasts at all
Parallel
Algorithms
A. Legrand
Communications
Matrix 16 Communications
Multiplication
Outer Product
Grid Rocks!
Cannon
Fox 17 Matrix Multiplication
Snyder
Data Outer Product
Distribution
Grid Rocks!
Cannon
Fox
Snyder
Data Distribution
190 / 235
Parallel
Algorithms Matrix Multiplication on a
A. Legrand
Grid
Communications
Matrix
Multiplication Matrix multiplication on a Grid has been studied a
Outer Product
Grid Rocks!
lot because
Cannon Multiplying huge matrices fast is always
Fox
Snyder important in many, many fields
Data
Distribution Each year there is at least a new paper on
the topic
It’s a really good way to look at and learn
A. Legrand
2-D Matrix Distribution
Communications
Matrix
Multiplication
Outer Product
P0,0 P0,1 P0,2 P0,2
Grid Rocks!
Cannon
We denote by ai,j an
Fox P1,0 P1,1 P1,2 P1,2 element of the matrix
Snyder
Data
Distribution
We denote by Ai,j (or Aij)
P2,0 P2,1 P2,2 P2,2 the block of the matrix
allocated to Pi,j
P2,0 P2,1 P2,2 P2,2
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13
C20 C21 C22 C23 A20 A21 A22 A23 B20 B21 B22 B23
C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 B32 B33
Courtesy of Henri Casanova
192 / 235
Parallel
Algorithms
How do Matrices Get Distributed? (Sec.
A. Legrand
4.7)
Communications
Matrix
Data distribution can be completely ad-hoc
Multiplication But what about when developing a library that will be used by others?
Outer Product There are two main options:
Grid Rocks! Centralized
Cannon
Fox
when calling a function (e.g., matrix multiplication)
the input data is available on a single “master” machine (perhaps in a file)
Snyder the input data must then be distributed among workers
Data
Distribution
the output data must be undistributed and returned to the “master” machine (perhaps in a file)
More natural/easy for the user
Allows for the library to make data distribution decisions transparently to the user
Prohibitively expensive if one does sequences of operations
and one almost always does so
Distributed
when calling a function (e.g., matrix multiplication)
Assume that the input is already distributed
Leave the output distributed
May lead to having to “redistribute” data in between calls so that distributions match,
which is harder for the user and may be costly as well
For instance one may want to change the block size between calls, or go from a non-cyclic to a
cyclic distribution
Most current software adopt the distributed approach
more work for the user
more flexibility and control
Courtesy of Henri Casanova
We’ll always assume that the data is magically already distributed by the user
193 / 235
Parallel
Algorithms Four Matrix Multiplication
A. Legrand
Algorithms
Communications
Matrix
Multiplication We’ll look at four algorithms
Outer Product
Outer-Product
Grid Rocks!
Cannon
Cannon
Fox
Snyder
Fox
Data
Distribution
Snyder
A. Legrand
The Outer-Product Algorithm
Communications
Matrix
Multiplication
Consider the “natural” sequential matrix multiplication
Outer Product algorithm
Grid Rocks!
Cannon
for i=0 to n-1
Fox for j=0 to n-1
Snyder for k=0 to n-1
Data
Distribution ci,j += ai,k * bk,j
This algorithm is a sequence of inner-products (also called
scalar products)
We have seen that we can switch loops around
Let’s consider this version
for k=0 to n-1
for i=0 to n-1
for j=0 to n-1
ci,j += ai,k * bk,j
This algorithm is a sequence of outer-products!
A. Legrand
The Outer-Product Algorithm
Communications
Matrix
Multiplication for k=0 to n-1
Outer Product for i=0 to n-1
Grid Rocks!
Cannon for j=0 to n-1
Fox
Snyder
ci,j += ai,k * bk,j
Data
Distribution
K=0 B K=1 B
A C += x A C += x
A. Legrand
The outer-product algorithm
Communications
Matrix
Multiplication
Why do we care about switching the loops around to view the matrix
Outer Product multiplication as a sequence of outer products?
Grid Rocks! Because it makes it possible to design a very simple parallel algorithm on
Cannon a grid of processors!
Fox
Snyder
First step: view the algorithm in terms of the blocks assigned to the
Data processors
Distribution
for k=0 to q-1
for i=0 to q-1
for j=0 to q-1
Ci,j += Ai,k * Bk,j
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13
C20 C21 C22 C23 A20 A21 A22 A23 B20 B21 B22 B23
C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 BCourtesy
32 B33 of Henri Casanova
197 / 235
Parallel
Algorithms
A. Legrand
The Outer-Product Algorithm
Communications
Matrix
Multiplication C00 C01 C02 C03 A00 A01 A02 A03 B0 B0 B0 B0 for k=0 to q-1
Outer Product
Grid Rocks! C10 C11 C12 C13 A10 A11 A12 A13 B1
0
B1
1
B1
2
B1
3 for i=0 to q-1
Cannon
C20 C21 C22 C23 A20 A21 A22 A23 B2
0
B2
1
B2
2
B2
3 for j=0 to q-1
Fox
Snyder C30 C31 C32 C33 A30 A31 A32 A33 B3
0
B3
1
B3
2
B3
3 Ci,j += Ai,k * Bk,j
Data
0 1 2 3
Distribution
needed block of A
Otherwise, it needs to get it from P
i,k
If k = I, then the processor already has the
needed block of B
Otherwise, it needs to get it from P
k,j
A. Legrand
The Outer-Product Algorithm
Communications
Matrix
Multiplication Based on the previous statements, we can now
see how the algorithm works
Outer Product
Grid Rocks!
Cannon
Fox At step k
Snyder
Processor P
i,k broadcasts its block of matrix A
Data
Distribution
to all processors in processor row i
True for all i
Processor P
k,j broadcasts its block of matrix B
to all processor in processor column j
True for all j
A. Legrand
The Outer Product Algorithm
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
Cannon
Fox P00 A01 P02 P03 P00 P01 P02 P03
Snyder
Data
Distribution
P10 A11 P12 P13 B10 B11 B12 B13
P20 A21 P22 P23 P20 P21 P22 P23
P30 A31 P32 P33 P30 P31 P32 P33
A. Legrand
The Outer-Product Algorithm
Communications
Matrix
// m = n/q
Multiplication
var A, B, C: array[0..m-1, 0..m-1] of real
Outer Product
Grid Rocks!
var bufferA, bufferB: array[0..m-1, 0..m-1] of real
Cannon var myrow, mycol
Fox myrow = My_Proc_Row()
Snyder mycol = My_Proc_Col()
Data for k = 0 to q-1
Distribution
// Broadcast A along rows
for i = 0 to q-1
BroadcastRow(i,k,A,bufferA,m*m)
// Broadcast B along columns
for j=0 to q-1
BroadcastCol(k,j,B,bufferB,m*m)
// Multiply Matrix blocks (assuming a convenient MatrixMultiplyAdd()
function)
if (myrow == k) and (mycol == k)
MatrixMultiplyAdd(C,A,B,m)
else if (myrow == k)
MatrixMultiplyAdd(C,bufferA,B,m)
else if (mycol == k)
MatrixMultiplyAdd(C, A, bufferB, m)
else
Courtesy of Henri Casanova
MatrixMultiplyAdd(C, bufferA, bufferB, m)
201 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Communications
Matrix
Multiplication
The performance analysis is straightforward
Outer Product With a one-port model:
Grid Rocks!
Cannon
The matrix multiplication at step k can occur in parallel with
Fox the broadcasts at step k+1
Snyder Both broadcasts happen in sequence
Data
Distribution Therefore, the execution time is equal to:
A. Legrand
So what?
Communications
Matrix
Multiplication On a ring platform we had already given an
asymptotically optimal matrix multiplication
Outer Product
Grid Rocks!
Cannon
Fox
algorithm on a ring in an earlier set of slides
Snyder
Data
So what’s the big deal about another
Distribution
asymptotically optimal algorithm?
Once again, when n is huge, indeed we don’t
care
But communication costs are often non-negligible
and do matter
When n is “moderate”
When w/b is low
It turns out, that the grid topology is
advantageous for reducing communication costs!
Courtesy of Henri Casanova
203 / 235
Parallel
Algorithms
A. Legrand
Ring vs. Grid
Communications
Matrix
Multiplication
When we discussed the ring, we found that the
Outer Product communication cost of the matrix multiplication algorithm
Grid Rocks! was: n2 b
Cannon
Fox
A each step, the algorithm sends n2/p matrix elements among
Snyder neighboring processors
Data
Distribution
There are p steps
For the algorithm on a grid:
Each step involves 2 broadcasts of n2/p matrix elements
Assuming a one-port model, not to give an “unfair” advantage to
the grid topology
Using a pipelined broadcast, this can be done in approximately
the same time as sending n2/p matrix elements between
neighboring processors on each ring (unless n is really small)
Therefore, at each step, the algorithm on a grid spends twice
as much time communicating as the algorithm on a ring
But it does sqrt(p) fewer steps!
Conclusion: the algorithm on a grid spends at least sqrt(p)
less time in communication than the algorithm on a ring
Courtesy of Henri Casanova
204 / 235
Parallel
Algorithms
A. Legrand
Grid vs. Ring
Communications
Matrix
Multiplication Why was the algorithm on a Grid much better?
Outer Product
Grid Rocks! Reason: More communication links can be used
Cannon
Fox in parallel
Snyder
Data
Point-to-point communication replaced by broadcasts
Distribution
Horizontal and vertical communications may be
concurrent
More network links used at each step
Of course, this advantage isn’t really an
advantage if the underlying physical platform
does not really look like a grid
But, it turns out that the 2-D distribution is
inherently superior to the 1-D distribution, no
matter what the underlying platform is!
Courtesy of Henri Casanova
205 / 235
Parallel
Algorithms
A. Legrand
Grid vs. Ring
Communications
Matrix On a ring
Multiplication
Outer Product
The algorithm communicates p matrix block rows that each
Grid Rocks! contain n2/p elements, p times
Cannon
Fox
Total number of elements communicated: pn2
Snyder
Data
On a grid
Distribution Each step, 2sqrt(p) blocks of n2/p elements are sent, each to
sqrt(p)-1 processors, sqrt(p) times
Total number of elements communicated: 2sqrt(p)n2
Conclusion: the algorithm with a grid in mind
inherently sends less data around than the algorithm
on a ring
Using a 2-D data distribution would be better than
using a 1-D data distribution even if the underlying
platform were a non-switched Ethernet for instance!
Which is really 1 network link, and one may argue is closer to
a ring (p comm links) than a grid (p2 comm links) Courtesy of Henri Casanova
206 / 235
Parallel
Algorithms
A. Legrand
Conclusion
Communications
Matrix
Multiplication Writing algorithms on a grid topology is a little bit
more complicated than in a ring topology
Outer Product
Grid Rocks!
Cannon
Fox But there is often a payoff in practice and grid
topologies are very popular
Snyder
Data
Distribution
A. Legrand
2-D Matrix Distribution
Communications
Matrix
Multiplication
Outer Product
P0,0 P0,1 P0,2 P0,2
Grid Rocks!
Cannon
We denote by ai,j an
Fox P1,0 P1,1 P1,2 P1,2 element of the matrix
Snyder
Data
Distribution
We denote by Ai,j (or Aij)
P2,0 P2,1 P2,2 P2,2 the block of the matrix
allocated to Pi,j
P2,0 P2,1 P2,2 P2,2
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13
C20 C21 C22 C23 A20 A21 A22 A23 B20 B21 B22 B23
C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 B32 B33
Courtesy of Henri Casanova
208 / 235
Parallel
Algorithms
A. Legrand
The Cannon Algorithm
Communications
Matrix
Multiplication
Outer Product
This is a very old algorithm
Grid Rocks!
Cannon
From the time of systolic arrays
Adapted to a 2-D grid
Fox
Snyder
Data
Distribution The algorithm starts with a
redistribution of matrices A and B
Called “preskewing”
Then the matrices are multiplied
Then the matrices are re-
redistributed to match the initial
distribution
Called “postskewing” Courtesy of Henri Casanova
209 / 235
Parallel
Algorithms
A. Legrand
Cannon’s Preskewing
Communications
Matrix
Multiplication
Outer Product
Matrix A: each block row of matrix A is
Grid Rocks!
Cannon
shifted so that each processor in the first
Fox
Snyder
processor column holds a diagonal block
Data
Distribution of the matrix
A. Legrand
Cannon’s Preskewing
Communications
Matrix
Multiplication Matrix B: each block column of matrix B is
shifted so that each processor in the first
Outer Product
Grid Rocks!
Cannon
Fox processor row holds a diagonal block of
Snyder
Data
the matrix
Distribution
A. Legrand
Cannon’s Computation
Communications
Matrix
Multiplication
Outer Product
The algorithm proceeds in q steps
Grid Rocks!
Cannon
Fox
At each step each processor
Snyder
Data performs the multiplication of its
Distribution
block of A and B and adds the result
to its block of C
Then blocks of A are shifted to the
left and blocks of B are shifted
upward
Blocks of C never move
Let’s see it on a picture Courtesy of Henri Casanova
212 / 235
Parallel
Algorithms
A. Legrand
Cannon’s Steps
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B00 B11 B22 B33
Outer Product
Grid Rocks!
C10 C11 C12 C13 A11 A12 A13 A10 B10 B21 B32 B03 local
B20 B31 B02 B13 computation
Cannon
C20 C21 C22 C23 A22 A23 A20 A21
Fox
Snyder
on proc (0,0)
Data C30 C31 C32 C33 A33 A30 A31 A32 B30 B01 B12 B23
Distribution
C00 C01 C02 C03 A01 A02 A03 A00 B10 B21 B32 B03
C10 C11 C12 C13 A12 A13 A10 A11 B20 B31 B02 B13 Shifts
C20 C21 C22 C23 A23 A20 A21 A22 B30 B01 B12 B23
C30 C31 C32 C33 A30 A31 A32 A33 B00 B11 B22 B33
C00 C01 C02 C03 A01 A02 A03 A00 B10 B21 B32 B03
C10 C11 C12 C13 A12 A13 A10 A11 B20 B31 B02 B13 local
computation
C20 C21 C22 C23 A23 A20 A21 A22 B30 B01 B12 B23
on proc (0,0)
C30 C31 C32 C33 A30 A31 A32 A33 B00 B11 B22 B33 Courtesy of Henri Casanova
213 / 235
Parallel
Algorithms
A. Legrand
The Algorithm
Communications
Matrix
Multiplication
Outer Product
Participate in preskewing of A
Grid Rocks!
Cannon Partitipate in preskweing of B
Fox
Snyder
Data
For k = 1 to q
Distribution
Local C = C + A*B
Vertical shift of B
Horizontal shift of A
Participate in postskewing of A
Partitipate in postskewing of B
A. Legrand
Performance Analysis
Communications
Matrix
Multiplication Let’s do a simple performance analysis
with a 4-port model
Outer Product
Grid Rocks!
Cannon
Fox
The 1-port model is typically more complicated
Snyder
Data
Distribution
Symbols
n: size of the matrix
qxq: size of the processor grid
m=n/q
L: communication start-up cost
w: time to do a basic computation (+= . * .)
b: time to communicate a matrix element
T(m,q) = Tpreskew + Tcompute +
Tpostskew
Courtesy of Henri Casanova
215 / 235
Parallel
Algorithms
A. Legrand
Pre/Post-skewing times
Communications
Matrix
Multiplication
Let’s consider the horizontal shift
Outer Product Each row must be shifted so that the diagonal block ends
Grid Rocks!
Cannon
up on the first column
Fox On a mono-directional ring:
Snyder
Data
The last row needs to be shifted (q-1) times
Distribution All rows can be shifted in parallel
Total time needed: (q-1) (L + m2 b)
On a bi-directional ring, a row can be shifted left or right,
depending on which way is shortest!
A row is shifted at most floor(q/2) times
All rows can be shifted in parallel
Total time needed: floor(q/2) (L + m2 b)
Because of the 4-port assumption, preskewing of A and B
can occur in parallel (horizontal and vertical shifts do not
interfere)
Therefore: Tpreskew = Tpostskew = floor(q/2) (L+m2b)
Courtesy of Henri Casanova
216 / 235
Parallel
Algorithms
A. Legrand
Time for each step
Communications
Matrix
Multiplication
Outer Product
At each step, each processor computes an
Grid Rocks!
Cannon
mxm matrix multiplication
Fox
Snyder
Compute time: m3 w
At each step, each processor
Data
Distribution
A. Legrand
Cannon Performance Model
Communications
Matrix
Multiplication
Outer Product
T(m,n) =2* floor(q/2) (L + m2b) +
Grid Rocks!
Cannon q max(m3w, L + m2b)
Fox
Snyder
Data
This performance model is easily
Distribution
adapted
If one assumes mono-directional links,
then the “floor(q/2)” above becomes
“(q-1)”
If one assumes 1-port, there is a factor 2
added in front of communication terms
If one assumes no overlap of
communication and computation Courtesy
at aof Henri 218
Casanova
/ 235
Parallel
Algorithms
A. Legrand
The Fox Algorithm
Communications
Matrix
Multiplication This algorithm was originally developed to
run on a hypercube topology
Outer Product
Grid Rocks!
Cannon
Fox
But in fact it uses a grid, embedded in the
Snyder
Data
hypercube
This algorithm requires no pre- or post-
Distribution
skewing
It relies on horizontal broadcasts of the
diagonals of matrix A and on vertical shifts
of matrix B
Sometimes called the “multiply-broadcast-
roll” algorithm
Let’s see it on a picture
Although it’s a bit awkward to draw because of 219 / 235
Courtesy of Henri Casanova
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
Outer Product
Grid Rocks!
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13 initial
Cannon
C20 C21 C22 C23 A20 A21 A22 A23 B B B B state
Fox 20 21 22 23
Snyder
Data C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 B32 B33
Distribution
C00 C01 C02 C03 A00 A00 A00 A00 B00 B01 B02 B03 Broadcast of
A’s 1st diag.
C10 C11 C12 C13 A11 A11 A11 A11 B10 B11 B12 B13 (stored in a
C20 C21 C22 C23 A22 A22 A22 A22 B20 B21 B22 B23 Separate
buffer)
C30 C31 C32 C33 A33 A33 A33 A33 B30 B31 B32 B33
C00 C01 C02 C03 A00 A00 A00 A00 B00 B01 B02 B03
C10 C11 C12 C13 A11 A11 A11 A11 B10 B11 B12 B13 Local
computation
C20 C21 C22 C23 A22 A22 A22 A22 B20 B21 B22 B23
C30 C31 C32 C33 A33 A33 A33 A33 B30 B31 B32 B33 Courtesy of Henri Casanova
220 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B10 B11 B12 B13
Outer Product
Grid Rocks!
C10 C11 C12 C13 A10 A11 A12 A13 B20 B21 B22 B23 Shift of B
Cannon
Fox C20 C21 C22 C23 A20 A21 A22 A23 B30 B31 B32 B33
Snyder
Data C30 C31 C32 C33 A30 A31 A32 A33 B00 B01 B02 B03
Distribution
C00 C01 C02 C03 A01 A01 A01 A01 B10 B11 B12 B13 Broadcast of
A’s 2nd diag.
C10 C11 C12 C13 A12 A12 A12 A12 B20 B21 B22 B23
(stored in a
C20 C21 C22 C23 A23 A23 A23 A23 B30 B31 B32 B33 Separate
C30 C31 C32 C33 A30 A30 A30 A30 B B B B buffer)
00 01 02 03
C00 C01 C02 C03 A01 A01 A01 A01 B10 B11 B12 B13
C10 C11 C12 C13 A12 A12 A12 A12 B20 B21 B22 B23 Local
computation
C20 C21 C22 C23 A23 A23 A23 A23 B30 B31 B32 B33
C30 C31 C32 C33 A30 A30 A30 A30 B00 B01 B02 B03 Courtesy of Henri Casanova
221 / 235
Parallel
Algorithms
A. Legrand
Fox’s Algorithm
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
Cannon
// No initial data movement
Fox
Snyder for k = 1 to q in parallel
Broadcast A’s kth diagonal
Data
Distribution
Local C = C + A*B
Vertical shift of B
// No final data movement
A. Legrand
Performance Analysis
Communications
Matrix
Multiplication
Outer Product
You’ll have to do it in a homework
Grid Rocks!
Cannon assignment
Fox
Snyder
Data
Write pseudo-code of the algorithm in
Distribution
more details
Write the performance analysis
A. Legrand
Snyder’s Algorithm (1992)
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
More complex than Cannon’s or
Cannon
Fox
Snyder Fox’s
Data
Distribution
First transposes matrix B
Uses reduction operations (sums) on
the rows of matrix C
Shifts matrix B
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
Outer Product
Grid Rocks!
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13 initial
Cannon
C20 C21 C22 C23 A20 A21 A22 A23 B B B B state
Fox 20 21 22 23
Snyder
Data C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 B32 B33
Distribution
C00 C01 C02 C03 A00 A01 A02 A03 B00 B10 B20 B30
C10 C11 C12 C13 A10 A11 A12 A13 B01 B11 B21 B31
Transpose B
C20 C21 C22 C23 A20 A21 A22 A23 B02 B12 B22 B32
C30 C31 C32 C33 A30 A31 A32 A33 B03 B13 B23 B33
C00 C01 C02 C03 A00 A01 A02 A03 B00 B10 B20 B30
C10 C11 C12 C13 A10 A11 A12 A13 B01 B11 B21 B31 Local
computation
C20 C21 C22 C23 A20 A21 A22 A23 B02 B12 B22 B32
C30 C31 C32 C33 A30 A31 A32 A33 B03 B13 B23 B33 Courtesy of Henri Casanova
225 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix C00 C01 C02 C03 A00 A01 A02 A03 B01 B11 B21 B31
Multiplication
Outer Product C10 C11 C12 C13 A10 A11 A12 A13 B02 B12 B22 B32
Grid Rocks! Shift B
Cannon C20 C21 C22 C23 A20 A21 A22 A23 B03 B13 B23 B32
Fox
Snyder
C30 C31 C32 C33 A30 A31 A32 A33 B00 B10 B20 B30
Data
Distribution
C00 C01 C02 C03 A00 A01 A02 A03 B01 B11 B21 B31
Global
C10 C11 C12 C13 A10 A11 A12 A13 B02 B12 B22 B32
sum
C20 C21 C22 C23 A20 A21 A22 A23 B03 B13 B23 B32 on the rows
C30 C31 C32 C33 A30 A31 A32 A33 B00 B10 B20 B30 of C
C00 C01 C02 C03 A00 A01 A02 A03 B01 B11 B21 B31
C10 C11 C12 C13 A10 A11 A12 A13 B02 B12 B22 B32 Local
C20 C21 C22 C23 A20 A21 A22 A23 B03 B13 B23 B32 computation
C30 C31 C32 C33 A30 A31 A32 A33 B00 B10 B20 B30 Courtesy of Henri Casanova
226 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B02 B12 B22 B32
Outer Product
C10 C11 C12 C13 A10 A11 A12 A13 B03 B13 B23 B33
Grid Rocks! Shift B
Cannon
Fox C20 C21 C22 C23 A20 A21 A22 A23 B00 B10 B20 B30
Snyder
Data C30 C31 C32 C33 A30 A31 A32 A33 B01 B11 B21 B31
Distribution
C00 C01 C02 C03 A00 A01 A02 A03 B02 B12 B22 B32
Global
C10 C11 C12 C13 A10 A11 A12 A13 B03 B13 B23 B33 sum
C20 C21 C22 C23 A20 A21 A22 A23 B00 B10 B20 B30 on the rows
of C
C30 C31 C32 C33 A30 A31 A32 A33 B01 B11 B21 B31
C00 C01 C02 C03 A00 A01 A02 A03 B02 B12 B22 B32
C10 C11 C12 C13 A10 A11 A12 A13 B03 B13 B23 B33 Local
C20 C21 C22 C23 A20 A21 A22 A23 B B B B computation
00 10 20 30
C30 C31 C32 C33 A30 A31 A32 A33 B01 B11 B21 B31 Courtesy of Henri Casanova
227 / 235
Parallel
Algorithms
A. Legrand
The Algorithm
Communications
Matrix
Multiplication var A,B,C: array[0..m-1][0..m-1] of real
Outer Product
Grid Rocks! var bufferC: array[0..m-1][0..m-1] of real
Cannon
Fox
Snyder
Transpose B
Data
Distribution MatrixMultiplyAdd(bufferC, A, B, m)
Vertical shifts of B
For k = 1 to q-1
Global sum of bufferC on proc rows into Ci,(i+k-1)%q
MatrixMultiplyAdd(bufferC, A, B, m)
Vertical shift of B
Global sum of bufferC on proc rows into Ci,(i+k-1)%q
Transpose B
Courtesy of Henri Casanova
228 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Communications
Matrix
Multiplication
Outer Product
The performance analysis isn’t
Grid Rocks!
Cannon fundamentally different than what
we’ve done so far
Fox
Snyder
Data
Distribution
But it’s a bit cumbersome
See the textbook
in particular the description of the
matrix transposition (see also Exercise
5.1)
A. Legrand
Which Data Distribution?
Communications
Matrix
Multiplication
Outer Product
So far we’ve seen:
Grid Rocks!
Cannon
Block Distributions
1-D Distributions
Fox
Snyder
Data
Distribution 2-D Distributions
Cyclic Distributions
One may wonder what a good choice
is for a data distribution?
Many people argue that a good
“Swiss Army knife” is the “2-D block
cyclic distribution
Courtesy of Henri Casanova
230 / 235
Parallel
Algorithms The 2-D block cyclic
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
Goal: try to have all the advantages
Grid Rocks!
Cannon
of both the horizontal and the
Fox
Snyder
vertical 1-D block cyclic distribution
Data
Distribution Works whichever way the computation
“progresses”
left-to-right, top-to-bottom, wavefront, etc.
Consider a number of processors p =
r*c
arranged in a rxc matrix
Consider a 2-D matrix of size NxN
Consider a block size b (which Courtesy of Henri Casanova
divides N) 231 / 235
Parallel
Algorithms The 2-D block cyclic
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
Grid Rocks! P0 P1 P2
Cannon
Fox P3 P4 P5
Snyder
Data
Distribution
b
Courtesy of Henri Casanova
232 / 235
Parallel
Algorithms The 2-D block cyclic
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
Grid Rocks! P0 P1 P2
Cannon
Fox P0 P1 P2 P3 P4 P5
Snyder
Data
Distribution P3 P4 P5
N
b
Courtesy of Henri Casanova
233 / 235
Parallel
Algorithms The 2-D block cyclic
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
Grid Rocks! P0 P1 P2
Cannon
Fox P0 P1 P2 P0 P1 P2 P0 P1 P3 P4 P5
Snyder
Data
Distribution P3 P4 P5 P3 P4 P5 P3 P4
P0 P1 P2 P0 P1 P2 P0 P1 Slight load imbalance
Becomes negligible with
N many blocks
P3 P4 P5 P3 P4 P5 P3 P4 Index computations had
better be implemented in
separate functions
P0 P1 P2 P0 P1 P2 P0 P1 Also: functions that tell a
process who its neighbors
P3 P4 P5 P3 P4 P5 P3 P4
are
Overall, requires a whole
infrastructure, but many
P0 P1 P2 P0 P1 P2 P0 P1 b think you can’t go wrong
with this distribution
b
Courtesy of Henri Casanova
234 / 235
Parallel
Algorithms
A. Legrand
Conclusion
Communications
Matrix
Multiplication
Outer Product
All the algorithms we have seen in the
Grid Rocks!
Cannon
semester can be implemented on a 2-D
Fox
Snyder
block cyclic distribution
The code ends up much more complicated
Data
Distribution
Matrix
Journal of Parallel and Distributed Computing, 44(1):71–79, 1997.
Multiplication
Outer Product D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos,
Grid Rocks!
Cannon K. Schauser, R. Subramonian, and T. von Eicken.
Fox
Snyder LogP: a practical model of parallel computation.
Data
Distribution Communication of the ACM, 39(11):78–85, 1996.
R. W. Hockney.
The communication challenge for mpp : Intel paragon and meiko
cs-2.
Parallel Computing, 20:389–398, 1994.
B. Hong and V.K. Prasanna.
Distributed adaptive task allocation in heterogeneous computing
environments to maximize throughput.
235 / 235
Parallel In International Parallel and Distributed Processing Symposium
Algorithms
IPDPS’2004. IEEE Computer Society Press, 2004.
A. Legrand
Communications
T. Kielmann, H. E. Bal, and K. Verstoep.
Matrix
Fast measurement of LogP parameters for message passing plat-
Multiplication
Outer Product
forms.
Grid Rocks! In Proceedings of the 15th IPDPS. Workshops on Parallel and
Cannon
Fox Distributed Processing, 2000.
Snyder
Data
Distribution Steven H. Low.
A duality model of TCP and queue management algorithms.
IEEE/ACM Transactions on Networking, 2003.
Dong Lu, Yi Qiao, Peter A. Dinda, and Fabián E. Bustamante.
Characterizing and predicting tcp throughput on the wide area
network.
In Proceedings of the 25th IEEE International Conference on Dis-
tributed Computing Systems (ICDCS’05), 2005.
235 / 235
Parallel
Algorithms Arnaud Legrand, Hélène Renard, Yves Robert, and Frédéric
A. Legrand Vivien.
Mapping and load-balancing iterative computations on heteroge-
Communications
Matrix
neous clusters with shared links.
Multiplication IEEE Trans. Parallel Distributed Systems, 15(6):546–558, 2004.
Outer Product
Grid Rocks!
Cannon Maxime Martinasso.
Fox
Snyder Analyse et modélisation des communications concurrentes dans
Data
Distribution les réseaux haute performance.
PhD thesis, Université Joseph Fourier de Grenoble, 2007.
Laurent Massoulié and James Roberts.
Bandwidth sharing: Objectives and algorithms.
In INFOCOM (3), pages 1395–1403, 1999.
Loris Marchal, Yang Yang, Henri Casanova, and Yves Robert.
Steady-state scheduling of multiple divisible load applications on
wide-area distributed computing platforms.
235 / 235
Parallel Int. Journal of High Performance Computing Applications, (3),
Algorithms
2006.
A. Legrand
Frédéric Wagner.
Communications
Redistribution de données à travers un réseau haut débit.
Matrix
Multiplication PhD thesis, Université Henri Poincaré Nancy 1, 2005.
Outer Product
Grid Rocks!
Cannon
Fox
Snyder
Data
Distribution
235 / 235