Calculating Mpi Pi
Calculating Mpi Pi
in Parallel
Using MPI
Aiichiro Nakano
Collaboratory for Advanced Computing & Simulations
Department of Computer Science
Department of Physics & Astronomy
Department of Chemical Engineering & Materials Science
University of Southern California
Email: [email protected]
Objectives
1. Task decomposition (parallel programming= who does what)
2. Scalability analysis
Integral Representation of !
!
dx
4
1+ x
2
=
d"
cos
2
"
4
1+ tan
2
"
0
# / 4
$
=
0
1
$
4d"
0
# / 4
$
= #
Numerical Integration of !
Integration
Discretization:
" = 1/N: step = 1/NBIN
x
i
= (i+0.5)" (i = 0,,N-1)
!
4
1+ x
2
dx
0
1
"
= #
!
4
1+ x
i
2
i=0
N"1
#
$ % &
#include <stdio.h>
#define NBIN 10000000
void main() {
int i; double step,x,sum=0.0,pi;
step = 1.0/NBIN;
for (i=0; i<NBIN; i++) {
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
pi = sum*step;
printf(PI = %f\n,pi);
}
Parallelization: Who Does What?
...
for (i=myid; i<NBIN; i+=nprocs)
{
x = (i+0.5)*step;
partial += 4.0/(1.0+x*x);
}
partial *= step;
pi = global_sum(partial);
...
Interleaved assignment of
quadrature points (bins) to
MPI processes
myid = MPI rank
nprocs = Number of MPI processes
Make global_sum() double & use MPI_DOUBLE in it
Use double MPI_Wtime()to measure the running time in seconds
Parallel Running Time
global_pi.c: NBIN = 10
7
, on hpc-login2
How Efcient Is the Parallel Program?
#PBS -l nodes=16:ppn=1,arch=x86_64
...
np=$(cat $PBS_NODEFILE | wc -l)
mpirun -np $np -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 8 -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 4 -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 2 -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 1 -machinefile $PBS_NODEFILE ./global_pi
Parallel Efciency
Execution time: T(W,P)
W: Workload
P: Number of processors
Speed:
Speedup:
Efciency:
How to scale W
P
with P?
!
S(W, P) =
W
T(W, P)
!
S
P
=
S(W
P
, P)
S(W
1
,1)
=
W
P
T(W
1
,1)
W
1
T(W
P
, P)
!
E
P
=
S
P
P
=
W
P
T(W
1
,1)
PW
1
T(W
P
, P)
Fixed Problem-Size Scaling
W
P
= Wconstant (strong scaling)
Speedup:
Efciency:
Amdahls law: f (= sequential fraction of the workload)
limits the asymptotic speedup
!
S
P
=
T(W,1)
T(W, P)
!
E
P
=
T(W,1)
PT(W, P)
!
T(W, P) = fT(W,1) +
(1" f )T(W,1)
P
!
"S
P
=
T(W,1)
T(W, P)
=
1
f + (1# f ) / P
!
"S
P
#
1
f
P #$ ( )
Isogranular Scaling
W
P
= Pw (weak scaling)
w = constant workload per processor (granularity)
Speedup:
Efciency:
!
S
P
=
S(P w, P)
S(w,1)
=
P w/T(P w, P)
w/T(w,1)
=
PT(w,1)
T(P w, P)
!
E
P
=
S
P
P
=
T(w,1)
T(P w, P)
Analysis of Global_Pi Program
Workload # Number of quadrature points, N (or NBIN in
the program)
Parallel execution time on P processors:
> Local computation # N/P
> Buttery computation/communication in global() # logP
!
T(N, P) = T
comp
(N, P) + T
global
(P)
="
N
P
+ #logP
for (i=myid; i<N; i+=P){
x = (i+0.5)*step; partial += 4.0/(1.0+x*x);
}
for (l=0; l<log
2
P; ++l) {
partner = myid XOR 2
l
;
send mydone to partner;
receive hisdone from partner;
mydone += hisdone
}
Fixed Problem-Size Scaling
Speedup:
Efciency:
!
S
P
=
T(N,1)
T(N, P)
=
"N
"N / P + #logP
=
P
1+
#
"
PlogP
N
!
E
P
=
S
P
P
=
1
1+
"
#
PlogP
N
global_pi.c: N = 10
7
, on hpc-login2
!
S
P
=
T (N,1)
T (N, P)
!
T (N, P) vs. P
!
E
P
=
T (N,1)
PT (N, P)
Fixed Problem-Size Scaling
Speedup model:
!
E
P
=
S
P
P
=
1
1+
"
#
PlogP
N
global_pi.c: N = 10
7
, on hpc-login2
Runtime Variance among Ranks
Isogranular Scaling
n = N/P = constant
Efciency:
global_pi_iso.c: N/P = 10
7
, on HPC
!
E
P
=
T(n,1)
T(nP, P)
=
an
"n + #logP
=
1
1+
#
"n
logP
!
T (P n, P) vs. P
!
E
P
=
T (n,1)
T (P n, P)