Cours 2
Cours 2
Parallel algorithms
Performance measure and load balancing
1 Performance tools
3 Nearly embarrassingly parallel algorithm
1 Performance tools
3 Nearly embarrassingly parallel algorithm
Definition
Let
• ts : Sequential execution time
• tp (n) : Execution time on n computing units ;
Speed-up is defined as :
ts
S(n) = (1)
tp (n)
Remark
The sequential algorithm is often different from the parallel algorithm. In this case, speed-up measure is not obvious. In particular,
the following questions must be asked among other questions :
• Is the sequential algorithm optimal in complexity ?
• Is the sequential algorithm well optimized ?
• Is the sequential algorithm exploiting at best the cache memory ?
This law is useful to find a reasonable number of computing cores to use for an application.
ts + n:tp (1 − n)ts
S(n) = =n+ = n + (1 − n):ts
ts + tp ts + tp
Definition
For a parallel program, scalability is the behaviour of the speed-up when we raise up the
number of processes or/and the amount of input data.
Ratio between computing intensity and quantity of data exchanged between processes
Definition
All processes execute a computation section of the code with same duration ;
• Speedup is badly impacted if some parts of the code are far away from load balancing ;
• Example 1 : A function takes t seconds for the half of the processes, and t for other processes. The maximal speed-up
2
for this function will be :
n n t
t + 3
S(n) = 2 2 2
= n
t 4
• Example 2 : A function takes t
seconds for n − 1 processes, and t for one process. The maximal speed-up for this
2
function will be :
(n − 1) 2t + t n−1 n+1
S(n) = = +1=
t 2 2
Remark : Longer is the time taken to execute a bad load balancing function, greater the penalty. Don’t worry about load
balancing for functions taking very small time to execute
1 Performance tools
3 Nearly embarrassingly parallel algorithm
Property
• In distributed parallel context, no data must be exchanged between processes to compute the results ;
• In shared parallel environment, parallelization is straightforward, but beware to the memory bound computation ;
• In distributed environment, the memory bound limitation is not an issue ;
• If data is contiguous and algorithm vectorizable, can be ideal on GPGPU for performance.
3
w = u + v ; u; v ; w ∈ R
Ideas
• For load balancing, scatter the vectors in equal parts among the threads or processes
• Each process/thread computes a part of the vector addition
• In distributed memory, the result is scattered among processes !
Some properties
• Memory access and computing operation have the same complexity : On shared memory, memory bound limits the
performance
• On distributed memory, each process uses his own physical memory and no data must be exchanged : Speed-up may be
linear relative to the number of processes (if data intensity is enough)
0 A 0 ::: 0 1 0 B 0 ::: 0 1
11 11
B .. C B .. C
B . C B . C
B 0 A22 0 C B 0 B22 0 C
A=B C;B = B C:
B . .. .. . C B . .. .. . C
@ . . . . A @ . . . . A
. . . .
0 ::: 0 Ann 0 ::: 0 Bnn
Problematic
Close to the vector addition multiplication, but :
• Dimensions di of diagonal blocks are inhomogeneous & for each diagonal block, computation complexity : di3 .
• How to distribute diagonal blocks among processes to obtain nearly optimal load balancing ?
Remark : All processes compute the distribution of the diagonal blocks. It is better to do same computation for all processes, than
having process 0 compute the distribution and send it to other processes.
Some definitions
• Length of flight : number of iterations for a series to reach the value 1 ;
• Height of the flight : maximal value reached by a series
The goal of the program : compute the length and the height of flight for a lot of (odd) values of u0
Problematic
• Each process computes the length and the height for a subset of initial values u0 ;
• The computation intensity depends of the length of each Syracuse series ;
• It’s impossible to know the computation complexity of a series, prior to computing it
• The problem is not naturally well balanced ;
⇒ Use a dynamic algorithm on "root" process (the "master" process) to distribute series among other processes ("slaves")
1 Performance tools
3 Nearly embarrassingly parallel algorithm
Definition
Independent computation for each process with a final communication to finalize the computation.
Examples
• Dot product of two vectors in Rn ;
• Compute an integral ;
• Matrix-vector product ;
Integral computation
• Integral computation based on Gauss quadrature formulae :
Z b Ng
X
f (x)dx ≈ !i f (gi )2
a i=1
Z b N Z
X bi N
X
I= f (x)dx = f (x)dx = Ii where a1 = a; aN+1 = bN = b and ai < bi = ai+1
a i=1 ai i=1
Main ideas
• Scatter sub-intervals among the processes P to compute partial sums :
X Z bi
Sp = f (x)dx
[ai ;bi ]∈P ai
nbp
X
S= Sp
p=1
m
X
n
v = A:u ∈ R where vi = Aij :uj
j=1
Let 0 1
A1
B A2 C
B C
B . C
B . C
B . C n
A=B C where ∀I ∈ {1; 2; : : : ; N} ; AI ∈ R N ×m :
B AI C
B C
B . C
B . C
@ . A
AN
Algorithm
• Each process has some rows of A and all of u
n
• Each process computes a part of v : the process I computes VI = AI :u ∈ R N
• To compute another matrix-vector product with the new vector, we need to gather the vector in all processes (only
necessary for distributed parallel algorithm).
Let 0 1
U1
B U2 C
B C
B . C
B . C
B . C m m
A = (A1 |A2 | : : : |AI | : : : |AN ) and u = B C where ∀I ∈ {1; 2; : : : ; N} ; AI ∈ Rn× N and UI ∈ R N
B UI C
B C
B . C
B . C
@ . A
UN
Algorithm
• Each process has some columns of A and some rows of u
• Each process computes a partial sum for v . Process I computes
n
VI = AI :UI ∈ R
N
X
• Finally, a sum reduction is done to get the final result : v = VI
I=1
ȷ
z0 = 0;
zn+1 = zn2 + c where c ∈ C chosen.
Figure – Mandelbrot (left) and Buddha (right) set
Algorithm
• Draw N random values of c in the disk D where the relative series diverge ;
• Compute the orbit of this series until divergence and increment the intensity of the
pixel representing each value of the orbit ;