0% found this document useful (0 votes)
116 views13 pages

Calculating Mpi Pi

1. The document describes calculating pi in parallel using MPI by decomposing the task of numerical integration into discrete parts that can be computed independently by different processes. 2. Each process is assigned a range of quadrature points to compute its contribution to the integral, with the results then combined using a global sum reduction. 3. Performance analysis shows that as the number of processes increases, efficiency decreases due to the increased communication overhead of the global sum operation.

Uploaded by

Fanny Ojeda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views13 pages

Calculating Mpi Pi

1. The document describes calculating pi in parallel using MPI by decomposing the task of numerical integration into discrete parts that can be computed independently by different processes. 2. Each process is assigned a range of quadrature points to compute its contribution to the integral, with the results then combined using a global sum reduction. 3. Performance analysis shows that as the number of processes increases, efficiency decreases due to the increased communication overhead of the global sum operation.

Uploaded by

Fanny Ojeda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Calculating !

in Parallel
Using MPI
Aiichiro Nakano
Collaboratory for Advanced Computing & Simulations
Department of Computer Science
Department of Physics & Astronomy
Department of Chemical Engineering & Materials Science
University of Southern California
Email: [email protected]
Objectives
1. Task decomposition (parallel programming= who does what)
2. Scalability analysis
Integral Representation of !
!
dx
4
1+ x
2
=
d"
cos
2
"
4
1+ tan
2
"
0
# / 4
$
=
0
1
$
4d"
0
# / 4
$
= #
Numerical Integration of !
Integration
Discretization:
" = 1/N: step = 1/NBIN
x
i
= (i+0.5)" (i = 0,,N-1)
!
4
1+ x
2
dx
0
1
"
= #
!
4
1+ x
i
2
i=0
N"1
#
$ % &
#include <stdio.h>
#define NBIN 10000000
void main() {
int i; double step,x,sum=0.0,pi;
step = 1.0/NBIN;
for (i=0; i<NBIN; i++) {
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
pi = sum*step;
printf(PI = %f\n,pi);
}
Parallelization: Who Does What?
...
for (i=myid; i<NBIN; i+=nprocs)
{
x = (i+0.5)*step;
partial += 4.0/(1.0+x*x);
}
partial *= step;
pi = global_sum(partial);
...
Interleaved assignment of
quadrature points (bins) to
MPI processes
myid = MPI rank
nprocs = Number of MPI processes
Make global_sum() double & use MPI_DOUBLE in it
Use double MPI_Wtime()to measure the running time in seconds
Parallel Running Time
global_pi.c: NBIN = 10
7
, on hpc-login2
How Efcient Is the Parallel Program?
#PBS -l nodes=16:ppn=1,arch=x86_64
...
np=$(cat $PBS_NODEFILE | wc -l)
mpirun -np $np -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 8 -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 4 -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 2 -machinefile $PBS_NODEFILE ./global_pi
mpirun -np 1 -machinefile $PBS_NODEFILE ./global_pi
Parallel Efciency
Execution time: T(W,P)
W: Workload
P: Number of processors
Speed:
Speedup:
Efciency:
How to scale W
P
with P?
!
S(W, P) =
W
T(W, P)
!
S
P
=
S(W
P
, P)
S(W
1
,1)
=
W
P
T(W
1
,1)
W
1
T(W
P
, P)
!
E
P
=
S
P
P
=
W
P
T(W
1
,1)
PW
1
T(W
P
, P)
Fixed Problem-Size Scaling
W
P
= Wconstant (strong scaling)
Speedup:
Efciency:
Amdahls law: f (= sequential fraction of the workload)
limits the asymptotic speedup
!
S
P
=
T(W,1)
T(W, P)
!
E
P
=
T(W,1)
PT(W, P)
!
T(W, P) = fT(W,1) +
(1" f )T(W,1)
P
!
"S
P
=
T(W,1)
T(W, P)
=
1
f + (1# f ) / P
!
"S
P
#
1
f
P #$ ( )
Isogranular Scaling
W
P
= Pw (weak scaling)
w = constant workload per processor (granularity)
Speedup:
Efciency:
!
S
P
=
S(P w, P)
S(w,1)
=
P w/T(P w, P)
w/T(w,1)
=
PT(w,1)
T(P w, P)
!
E
P
=
S
P
P
=
T(w,1)
T(P w, P)
Analysis of Global_Pi Program
Workload # Number of quadrature points, N (or NBIN in
the program)
Parallel execution time on P processors:
> Local computation # N/P
> Buttery computation/communication in global() # logP
!
T(N, P) = T
comp
(N, P) + T
global
(P)
="
N
P
+ #logP
for (i=myid; i<N; i+=P){
x = (i+0.5)*step; partial += 4.0/(1.0+x*x);
}
for (l=0; l<log
2
P; ++l) {
partner = myid XOR 2
l
;
send mydone to partner;
receive hisdone from partner;
mydone += hisdone
}
Fixed Problem-Size Scaling
Speedup:
Efciency:
!
S
P
=
T(N,1)
T(N, P)
=
"N
"N / P + #logP
=
P
1+
#
"
PlogP
N
!
E
P
=
S
P
P
=
1
1+
"
#
PlogP
N
global_pi.c: N = 10
7
, on hpc-login2
!
S
P
=
T (N,1)
T (N, P)
!
T (N, P) vs. P
!
E
P
=
T (N,1)
PT (N, P)
Fixed Problem-Size Scaling
Speedup model:
!
E
P
=
S
P
P
=
1
1+
"
#
PlogP
N
global_pi.c: N = 10
7
, on hpc-login2
Runtime Variance among Ranks
Isogranular Scaling
n = N/P = constant
Efciency:
global_pi_iso.c: N/P = 10
7
, on HPC
!
E
P
=
T(n,1)
T(nP, P)
=
an
"n + #logP
=
1
1+
#
"n
logP
!
T (P n, P) vs. P
!
E
P
=
T (n,1)
T (P n, P)

You might also like