Brief Overview of Parallel Computing
Brief Overview of Parallel Computing
A Little Motivation
M. D. Jones, Ph.D.
Spring 2014
Spring 2014
1 / 61
Background
Spring 2014
3 / 61
Background
Big Picture
Decomposition
decomposition
1D
CPU
CPU
CPU
CPU
CPU
CPU
2D
CPU
CPU
3D
CPU
CPU
CPU
CPU
Spring 2014
4 / 61
Uniform volume example - decomposition in 1D, 2D, and 3D. Note that
load balancing in this case is simpler if the work load per cell is the
same ...
M. D. Jones, Ph.D. (CCR/UB)
Spring 2014
5 / 61
Background
Decomposition (continued)
More Speed!
Note that, in the modern era, inevitably HPC = Parallel (or concurrent)
Computation. The driving forces behind this are pretty simple - the
desire is:
Solve my problem faster, i.e. I want the answer now (and who
doesnt want that?)
I want to solve a bigger problem than I (or anyone else, for that
matter) have ever before been able to tackle, and do so in a
reasonable (generally reasonable = within a graduate students
time to graduate!)
Nonuniform example - decomposition in 3D for a 16-way parallel
adaptive finite element calculation (in this case using a terrain
elevation). Load balancing in this case is much more complex.
M. D. Jones, Ph.D. (CCR/UB)
Spring 2014
6 / 61
More Speed!
Spring 2014
8 / 61
More Speed!
A Concrete Example
A Size Example
Example
Using classical gravitation, we have a very simple (but long-ranged)
force/potential. For each of N bodies, the resulting force is computed
from the other N 1 bodies, thereby requiring N 2 force calculations
per step. If a galaxy consists of approximately 1012 such bodies, and
even the best algorithm for computing requires N log2 N calculations,
that means ' 1012 ln(1012 )/ ln(2) calculations. If each calculation
takes ' 1 sec, that is 40 106 seconds per step. That is about 1.3
CPU-years per step. Ouch!
Spring 2014
9 / 61
Example
In this problem, we want to increase the resolution to capture the
essential underlying behavior of the physical process being modeled.
So we determine that we need a matrix on the order of, say, 400000
elements. Simply to store this matrix, in 64-bit representation, requires
' 1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this
onto a cluster with say, 103 nodes, each having 4 GBytes of memory,
by distributing the matrix across the individual memories of each
cluster node.
Spring 2014
10 / 61
More Speed!
So ...
Scaling Concepts
Inherent Limitations
Spring 2014
11 / 61
Inherent Limitations
Parallel Efficiency
Spring 2014
12 / 61
Inherent Limitations
S
,
p (p)
p,
p f S + (1 f ) S /p,
S(p)
,
p
S
=
.
p p
E(p) =
Spring 2014
13 / 61
Spring 2014
14 / 61
Inherent Limitations
Inherent Limitations
Amdahls Law
S
,
f S + (1 f ) S /p
p
.
1 + f (p 1)
S(p)
lim S(p)
1
.
f
G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.
Spring 2014
15 / 61
Amdahl's Law
16 / 61
256
f=0.001
f=0.01
f=0.02
f=0.05
f=0.1
64
Spring 2014
Inherent Limitations
A Practical Example
1 pcomp ,
32
MAX(S(p))
128
Inherent Limitations
and
16
S(p) =
8
4
1
,
p
pcomp
,
comm + comp
p 1 + comm /comp
1
1
8
32
16
Number of Processors, p
64
128
256
Spring 2014
1
Spring 2014
18 / 61
Inherent Limitations
Inherent Limitations
Gustafsons Law
Now let p be held constant as p is increased,
Ss (p) = S /p ,
= (f S + (1 f )S )/p ,
We assumed that the problem size was fixed, which is (very) often
not the case.
= (f S + (1 f )S )/(f S + (1 f )S /p),
Assume that now the problem size is fixed such that p is held
constant
= p/(1 (1 p)f ),
Spring 2014
19 / 61
Inherent Limitations
Spring 2014
20 / 61
Basic Terminology
Scalability
Definition
(Scalable): An algorithm is scalable if there is a minimal nonzero
efficiency as p and the problem size is allowed to take on any
value
There are many parallel programming models, but roughly they break
down into the following categories:
Threaded models: e.g., OpenMP or Posix threads as the
application programming interface (API)
Spring 2014
21 / 61
Spring 2014
23 / 61
Basic Terminology
Thread Models
Basic Terminology
Thread Models
thread 1
thread 2
a.out
thread 1
thread 2
a.out
data
data
time
time
thread N
thread N
Spring 2014
24 / 61
Basic Terminology
Spring 2014
25 / 61
Basic Terminology
a.out
PID
task 1
msg send/recv
data
a.out
a.out
data
task 1
PID
a.out
PID
task N1
time
PID
task N1
time
PID
data
data
msg send/recv
a.out
msg send/recv
task 0
task 0
msg send/recv
PID
data
data
Spring 2014
26 / 61
Spring 2014
27 / 61
Basic Terminology
Tasks How do I reduce this problem into a set of tasks that can
be executed concurrently?
Data How do I take the key data and represent it in such a way
that large chunks can be operated on independently (and
thus concurrently)?
Spring 2014
28 / 61
Basic Terminology
Dependencies
Spring 2014
29 / 61
Basic Terminology
Basic Terminology
Spring 2014
30 / 61
Spring 2014
31 / 61
Auto-Parallelization
Auto-Parallelization
Auto-Vectorization
Spring 2014
Auto-Vectorization
32 / 61
Auto-Vectorization
Spring 2014
33 / 61
Spring 2014
34 / 61
Spring 2014
35 / 61
OpenMP
Spring 2014
36 / 61
OpenMP Availability
Spring 2014
37 / 61
MPI
Version
3.1
3.0
2.5
3.0
3.1
Invocation (example)
ifort -openmp -openmp_report2 ...
pgf90 -mp ...
gfortran -fopenmp ...
gfortran -fopenmp ...
gfortran -fopenmp ...
Spring 2014
38 / 61
Spring 2014
39 / 61
Platform
Linux IA64
Linux x86_64
Version(+MPI-2)
1.2+(C++,MPI-I/O)
1.2+(C++,MPI-I/O),2.x(various)
I am starting to favor the commercial Intel MPI for its ease of use,
expecially in terms of supporting multiple networks/protocols.
Spring 2014
40 / 61
Spring 2014
41 / 61
Co-Array Fortran
Simple example of some UPC syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
shared i n t a l l _ h i t s [THREADS ] ;
...
...
f o r ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;
a l l _ h i t s [MYTHREAD] = my_hits ;
upc_barrier ;
i f (MYTHREAD == 0 ) {
t o t a l _ h i t s = 0;
f o r ( i =0; i < THREADS; i ++) {
t o t a l _ h i t s += a l l _ h i t s [ i ] ;
}
p i = 4 . 0 ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;
p r i n t f ( " PI e s t i m a t e d t o %10.7 f from %d t r i a l s on %d t h r e a d s . \ n " ,
p i , t r i a l s , THREADS ) ;
}
Spring 2014
42 / 61
Spring 2014
43 / 61
Libraries
REAL, DIMENSION (N ) [ ] : : X , Y
X
= Y [ PE ]
! g e t from Y [ PE ]
Y [ PE ]
= X
! p u t i n t o Y [ PE ]
Y[:]
= X
! b r oa d c a s t X
Y [ LIST ] = X
! b r oa d c a s t X over subset o f PE s i n a r r a y LIST
Z(:)
= Y[:]
! collect all Y
S = MINVAL (Y [ : ] ) ! min ( reduce ) a l l Y
B ( 1 :M) [ 1 : N] = S ! S s c a l a r , promoted t o a r r a y o f shape ( 1 :M, 1 : N)
Spring 2014
44 / 61
Libraries
BLAS
Vendor BLAS
Spring 2014
46 / 61
Spring 2014
48 / 61
Libraries
Spring 2014
47 / 61
AMD
Apple
Compaq
Cray
HP
IBM
Intel
NEC
SGI
SUN
ACML
Velocity Engine
CXML
libsci
MLIB
ESSL
MKL
PDLIB/SX
SCSL
Sun Performance Library
Libraries
Libraries
Performance Example
LAPACK
DDOT Performance
U2 Compute Node
5000
cMKL 8.1.1
Ref (-lblas)
4500
4000
L1 cache
L2 cache
3500
least-squares solutions
MFlop/s
3000
eigenvalue problems
2500
2000
Main memory
1500
1000
500
0
1
10
10
10
10
Vector Length [8B]
10
10
www.netlib.org/lapack
Spring 2014
49 / 61
Libraries
ScaLAPACK
Spring 2014
50 / 61
Programming Costs
Programming Costs
Spring 2014
51 / 61
Consider:
Complexity: parallel codes can be orders of magnitude more complex
(especially those using message passing) - you have to
plan for and deal with multiple instruction/data streams
Portability: parallel codes frequently have long lifetimes (proportional
to the amount of effort invested in them) - all of the serial
application porting issues apply, plus the choice of
parallel API (MPI, OpenMP, and POSIX threads are
currently good portable choices, but implementations of
them can differ from platform to platform)
Resources: overhead for parallel computation can be significant for
smaller calculations
Scalability: limitations in hardware (CPU-memory speed and
contention, for one example) and the parallel algorithms
will limit speedups. All codes will eventually reach a state
of decreasing returns at some point
M. D. Jones, Ph.D. (CCR/UB)
Spring 2014
53 / 61
Examples
Examples
256
ch_p4
ch_gm
128
79.9
88
Benchmark MD s/step
0.1
32
27.2
16
14.2
7.3
6.46
8
7.01
3.67
0.01
Parallel Speedup
64
49.6
3.54
2.8
1.9
1.89
1
JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023
Spring 2014
54 / 61
Communication Costs
8
32
16
Number of Processors (ppn=2)
128
Communications
64
256
Spring 2014
55 / 61
Communication Costs
Communication Considerations
Communication needs between parallel processes affect parallel
programs in several important ways:
Cost: there is always overhead when communicating:
Spring 2014
56 / 61
Spring 2014
57 / 61
Communication Costs
Communication Costs
U2 Cluster Interconnects
U2 Cluster Interconnects
10000
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
1000
1000
Bandwidth [MByte/s]
100
10
100
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
10
1 0
10
10
10
10
10
Message Length [Bytes]
10
0.1
10
10
Spring 2014
58 / 61
10
10
10
10
Message Length [Bytes]
10
0.01
10
Communication Costs
Spring 2014
59 / 61
Communication Costs
Intel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645
1e+05
10000
talltoall(4KB) [sec]
ppn=12
ppn=6
ppn=4
ppn=2
ppn=1
1000
100
10
10
100
Number of MPI Processes
1000
Spring 2014
60 / 61
Spring 2014
61 / 61