0% found this document useful (0 votes)
20 views25 pages

Cours 2

Uploaded by

Said Saterih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views25 pages

Cours 2

Uploaded by

Said Saterih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Parallel performance measure and embarrassingly

Parallel algorithms
Performance measure and load balancing

Xavier JUVIGNY, SN2A, DAAA, ONERA


[email protected]
Course Parallel Programming
1 ONERA,2 DAAA
- September 27th 2023 -
Ce document est la propriété de l’ONERA. Il ne peut être communiqué à des tiers et/ou reproduit sans l’autorisation préalable écrite de l’ONERA, et son contenu ne peut être divulgué.
This document and the information contained herein is proprietary information of ONERA and shall not be disclosed or reproduced without the prior authorization of ONERA.
Table of contents

2 Embarrassingly parallel algorithms

1 Performance tools
3 Nearly embarrassingly parallel algorithm

09/27/2023 X. JUVIGNY Parallel programming 1


Overview

2 Embarrassingly parallel algorithms

1 Performance tools
3 Nearly embarrassingly parallel algorithm

09/27/2023 X. JUVIGNY Parallel programming 2


Speed-up

Definition
Let
• ts : Sequential execution time
• tp (n) : Execution time on n computing units ;
Speed-up is defined as :
ts
S(n) = (1)
tp (n)

Remark
The sequential algorithm is often different from the parallel algorithm. In this case, speed-up measure is not obvious. In particular,
the following questions must be asked among other questions :
• Is the sequential algorithm optimal in complexity ?
• Is the sequential algorithm well optimized ?
• Is the sequential algorithm exploiting at best the cache memory ?

09/27/2023 X. JUVIGNY Parallel programming 3


Amdahl’s law

Give a limit for the speed-up


• Let ts be the time necessary to run the code in sequential
• Let f be the fraction of ts , relative to the part of the code which can’t be parallelized
So, the best expected speedup is :
ts n 1
S(n) = = −−−→
f :ts +
(1−f )ts 1 + (n − 1)f n→∞ f
n

This law is useful to find a reasonable number of computing cores to use for an application.

Limitation of the law


f may change with the volume of input data and bigger input data may improve the speed-up.

09/27/2023 X. JUVIGNY Parallel programming 4


Gustafson’s law

Speed-up behaviour with constant volume input data per process


• Hypotheses :
• ts ≥ 0 the time to execute the sequential part of the code is independent of the volume of input data ;
• tp > 0 the time to execute the parallel part of the code is linear relative of the volume of input data.
• Let’s consider ts + tp = 1 (one unit of time).
• Let ts be the time taken by the execution of the sequential part of the code ;
• Let tp be the time taken by the execution of the parallel part of the code for a fixed amount of data.

ts + n:tp (1 − n)ts
S(n) = =n+ = n + (1 − n):ts
ts + tp ts + tp

09/27/2023 X. JUVIGNY Parallel programming 5


Scalability

Definition
For a parallel program, scalability is the behaviour of the speed-up when we raise up the
number of processes or/and the amount of input data.

How to evaluate the scalability ?


• Evaluate the worst speed-up : For a global fixed amount of data, draw the speed-up
curve in function of the number of processes ;
• Evaluate the best speed-up : For a fixed amount of data per process, draw the
speed-up curve in function of the number of processes ;
• In concrete use of the program, the speed-up may be between the worst and best
scenario.

09/27/2023 X. JUVIGNY Parallel programming 6


Granularity

Ratio between computing intensity and quantity of data exchanged between processes

• Sending and receiving data is prohibitive :


• Initial cost of a message : each message has an initial cost : set the connection, get the same protocol, etc.This cost
is constant.
• Cost of the data transfer : at last, the cost of the data flow is linear with the number of data to exchange
• These costs are greater than the cost of memory operations in RAM
• Better to copy some sparse data in a buffer and send the buffer, rather than send scattered data with multiple send
and receives
• Try to minimize the number of data exchange between processes
• The greater the ratio between number of computation instructions and messages to exchange, the better will be your
speed-up !
• Low speedup can be improved with non blocking data exchanges.

09/27/2023 X. JUVIGNY Parallel programming 7


Load balancing

Definition
All processes execute a computation section of the code with same duration ;

• Speedup is badly impacted if some parts of the code are far away from load balancing ;
• Example 1 : A function takes t seconds for the half of the processes, and t for other processes. The maximal speed-up
2
for this function will be :
n n t
t + 3
S(n) = 2 2 2
= n
t 4
• Example 2 : A function takes t
seconds for n − 1 processes, and t for one process. The maximal speed-up for this
2
function will be :
(n − 1) 2t + t n−1 n+1
S(n) = = +1=
t 2 2
Remark : Longer is the time taken to execute a bad load balancing function, greater the penalty. Don’t worry about load
balancing for functions taking very small time to execute

09/27/2023 X. JUVIGNY Parallel programming 8


Overview

2 Embarrassingly parallel algorithms

1 Performance tools
3 Nearly embarrassingly parallel algorithm

09/27/2023 X. JUVIGNY Parallel programming 9


Definition

Embarrassingly parallel algorithm


• Each data used and computed are independent ;
• No data race in multithread context ;
• No communication between processes in distributed environment

Property
• In distributed parallel context, no data must be exchanged between processes to compute the results ;
• In shared parallel environment, parallelization is straightforward, but beware to the memory bound computation ;
• In distributed environment, the memory bound limitation is not an issue ;
• If data is contiguous and algorithm vectorizable, can be ideal on GPGPU for performance.

09/27/2023 X. JUVIGNY Parallel programming 10


First example : Vector addition

Add two real vectors of dimensions N

3
w = u + v ; u; v ; w ∈ R

Ideas
• For load balancing, scatter the vectors in equal parts among the threads or processes
• Each process/thread computes a part of the vector addition
• In distributed memory, the result is scattered among processes !

Some properties
• Memory access and computing operation have the same complexity : On shared memory, memory bound limits the
performance
• On distributed memory, each process uses his own physical memory and no data must be exchanged : Speed-up may be
linear relative to the number of processes (if data intensity is enough)

09/27/2023 X. JUVIGNY Parallel programming 11


Example : Block diagonal matrices multiplication C = A:B (1)

Matrix-matrix multiplication C = A:B where

0 A 0 ::: 0 1 0 B 0 ::: 0 1
11 11
B .. C B .. C
B . C B . C
B 0 A22 0 C B 0 B22 0 C
A=B C;B = B C:
B . .. .. . C B . .. .. . C
@ . . . . A @ . . . . A
. . . .
0 ::: 0 Ann 0 ::: 0 Bnn

where di = dim(Aii ) = dim(Bii ) (n independent matrix-matrix multiplications)

Problematic
Close to the vector addition multiplication, but :
• Dimensions di of diagonal blocks are inhomogeneous & for each diagonal block, computation complexity : di3 .
• How to distribute diagonal blocks among processes to obtain nearly optimal load balancing ?

09/27/2023 X. JUVIGNY Parallel programming 12


Example : Block diagonal matrices multiplication C = A:B (2)

algorithm to distribute diagonal blocks among processes

Example of algorithm to distribute the diagonal blocks


• Sort diagonal blocks with decreasing dimension ;
• Set "weight" to zero for each process
• Distribute biggest triplet blocks Aii ; Bii ; Cii among processes and add each di in the weight of each process ;

• While some diagonal blocks are not distributed :


• Add the biggest block which is not distributed to the process having the smallest weight
• Add the relative di at the weight of the process

Remark : All processes compute the distribution of the diagonal blocks. It is better to do same computation for all processes, than
having process 0 compute the distribution and send it to other processes.

09/27/2023 X. JUVIGNY Parallel programming 13


Third example : Syracuse series (1)

Definition of Syracuse series


8
< u0 chosen
> ( un
2 if un is even
: un+1 =
>
3:un + 1 if un is odd

Property of Syracuse series


• One cycle exists : 1 → 4 → 2 → 1 → · · ·
• A conjecture : ∀u0 ∈ N, the series reaches the cycle above in a finite number of iterations

Some definitions
• Length of flight : number of iterations for a series to reach the value 1 ;
• Height of the flight : maximal value reached by a series

The goal of the program : compute the length and the height of flight for a lot of (odd) values of u0

09/27/2023 X. JUVIGNY Parallel programming 14


Third example : Syracuse series (2)

Problematic
• Each process computes the length and the height for a subset of initial values u0 ;
• The computation intensity depends of the length of each Syracuse series ;
• It’s impossible to know the computation complexity of a series, prior to computing it
• The problem is not naturally well balanced ;
⇒ Use a dynamic algorithm on "root" process (the "master" process) to distribute series among other processes ("slaves")

Master’s Algorithm Slave’s Algorithm


• Send a small pack of series to each slave processes ; • While (receive some series to compute in a pack)
• While(some pack of series to send) do • Compute each series of the pack ;
• Wait slave asking series and send next pack ; • end While
• end While
• Send termination order to all slave processes ;

Remark : A task is composed of a pack of series to have a good granularity.


09/27/2023 X. JUVIGNY Parallel programming 15
Overview

2 Embarrassingly parallel algorithms

1 Performance tools
3 Nearly embarrassingly parallel algorithm

09/27/2023 X. JUVIGNY Parallel programming 16


Nearly embarrassingly parallel algorithm

Definition
Independent computation for each process with a final communication to finalize the computation.

Examples
• Dot product of two vectors in Rn ;
• Compute an integral ;
• Matrix-vector product ;

Non embarrassingly parallel algorithm examples


• Parallel sort algorithms ;
• Matrix-matrix product ;
• Algorithms based on domain decomposition methods ;

09/27/2023 X. JUVIGNY Parallel programming 17


Integral computation

Integral computation
• Integral computation based on Gauss quadrature formulae :

Z b Ng
X
f (x)dx ≈ !i f (gi )2
a i=1

where !i ∈ R are the weights and gi ∈ R the integration points.


• In fact, Gauss quadrature are given on [−1; 1] interval : some variable modification to do in the integral !;
• {g1 = 0; !1 = 2} : Order 1 Legendre Gauss quadrature ;
n“ √ ” ` ´ “ √ ”o
• g1 = − 22 ; !1 = 59 ; g2 = 0; !2 = 89 ; g3 = + 22 ; !3 = 59 : Order 3 Legendre Gauss quadrature
• Remark : Order n means that the quadrature computes the exact value of the integral for polynomials of degree less or
equal to n.
• To compute better approximation of the integral, we subdivide the interval in several smaller intervals

09/27/2023 X. JUVIGNY Parallel programming 18


Parallel integral computation

Z b N Z
X bi N
X
I= f (x)dx = f (x)dx = Ii where a1 = a; aN+1 = bN = b and ai < bi = ai+1
a i=1 ai i=1

Main ideas
• Scatter sub-intervals among the processes P to compute partial sums :

X Z bi
Sp = f (x)dx
[ai ;bi ]∈P ai

• Use reduce to compute the integral value (global sum) :

nbp
X
S= Sp
p=1

09/27/2023 X. JUVIGNY Parallel programming 19


Matrix-vector product

Let A ∈ Rn×m be a matrix and u ∈ Rm a vector.


The goal of this algorithm is to compute the matrix-vector product :

m
X
n
v = A:u ∈ R where vi = Aij :uj
j=1

Two possibilities to parallelize this algorithm :


• Partitioning the matrix by block of rows :
• Partitioning the matrix by block of columns and the vector u by block of same size.
The goal is to split the computation between processes and use a global communication operation to get the final result.

09/27/2023 X. JUVIGNY Parallel programming 20


Matrix-vector product by rows splitting

Let 0 1
A1
B A2 C
B C
B . C
B . C
B . C n
A=B C where ∀I ∈ {1; 2; : : : ; N} ; AI ∈ R N ×m :
B AI C
B C
B . C
B . C
@ . A
AN

Algorithm
• Each process has some rows of A and all of u
n
• Each process computes a part of v : the process I computes VI = AI :u ∈ R N
• To compute another matrix-vector product with the new vector, we need to gather the vector in all processes (only
necessary for distributed parallel algorithm).

09/27/2023 X. JUVIGNY Parallel programming 21


Matrix-vector product by columns splitting

Let 0 1
U1
B U2 C
B C
B . C
B . C
B . C m m
A = (A1 |A2 | : : : |AI | : : : |AN ) and u = B C where ∀I ∈ {1; 2; : : : ; N} ; AI ∈ Rn× N and UI ∈ R N
B UI C
B C
B . C
B . C
@ . A
UN

Algorithm
• Each process has some columns of A and some rows of u
• Each process computes a partial sum for v . Process I computes

n
VI = AI :UI ∈ R

N
X
• Finally, a sum reduction is done to get the final result : v = VI
I=1

09/27/2023 X. JUVIGNY Parallel programming 22


Buddha set

Let’s consider the complex recursive Mandelbrot series :

ȷ
z0 = 0;
zn+1 = zn2 + c where c ∈ C chosen.
Figure – Mandelbrot (left) and Buddha (right) set

Property Mandelbrot and Buddha sets


• Series is divergent if ∃n > 0; |zn | > 2 ; • Mandelbrot’s set :
• Region of interest : the disk D of radius 2 ; color c with "divergence speed" of relative series
• Buddha’s set :
• In some region of the disk, possible to prove convergence ;
Color orbit of divergent series
• But chaotic convergence behaviour in some region of D !

09/27/2023 X. JUVIGNY Parallel programming 23


Buddha’s set algorithm

Algorithm
• Draw N random values of c in the disk D where the relative series diverge ;
• Compute the orbit of this series until divergence and increment the intensity of the
pixel representing each value of the orbit ;

Parallelization of the algorithm


• Master-slave algorithm to ensure load balancing ;
• For granularity, define a task as a pack of random values c ;

09/27/2023 X. JUVIGNY Parallel programming 24

You might also like