Lecture HPC 11 Parallelization
Lecture HPC 11 Parallelization
• Moore’s Law (1965): transistor density of semiconductor chips would double roughly every 18 months.
• Inherent limits on serial machines imposed by the speed of light (30 cm/ns) and transmission limit of
copper wire (9 cm/ns): virtually impossible to build a serial Teraflop machine with current approach.
• Furthermore, real bottleneck is often memory access (RAM latency has only improved around 10% a
year).
1
Number of transistors
●
10,000,000,000 ● ●
● ● ●
●
●
● ● ● ●
● ● ● ●
●
● ● ●
● ● ●
●
● ● ● ●
●
● ● ●
●
● ● ●
● ● ●
1,000,000,000 ● ● ●
●
●
● ●
● ●
●
●
● ● ● ●
● ● ●
100,000,000 ● ●
● ● ●
● ●
Transistors
● ●
● ●
10,000,000 ●
● ●
●
●
●
● ● ●
● ● ●
1,000,000
● ●
● ● ● ●
● ●
● ●
100,000 ● ●
●
● ●
● ● ●
●
●
10,000 ● ●
●
●
●
●
● ●
●
●
1,000
1971 1976 1981 1986 1991 1996 2001 2006 2011 2016
Year 2
Cray-1, 1975
3
IBM Summit, 2018
4
Parallel programming
• Two issues:
1. Algorithms.
2. Coding.
5
Some references
• Introduction to High Performance Computing for Scientists and Engineers by Georg Hager and
Gerhard Wellein.
• Parallel Computing for Data Science: With Examples in R, C++ and CUDA, by Norman Matloff.
• Parallel Programming: Concepts and Practice by Bertil Schmidt, Jorge González-Domínguez, and
Christian Hundt.
• Structured Parallel Programming: Patterns for Efficient Computation by Michael McCool, James
Reinders, and Arch Robison.
6
When do we parallelize? I
• Scalability:
• Granularity:
7
Granularity
8
When do we parallelize? II
• Whether or not the problem is easy to parallelize may depend on the way you set it up.
9
Example I: value function iteration
V (k) = max
0
{u (c) + βV (k 0 )}
k
c = k + (1 − δ) k − k 0
α
c = k1α + (1 − δ) k1 − k 0
to processor 1 to get V n+1 (k1 ) .
4. We can send similar problem for each k to each processor.
5. When all processors are done, we gather the V n+1 (k1 ) back.
10
Example II: random walk Metropolis-Hastings
• Draw θ ∼ P (·)
• How?
1. Given a state of the chain θn−1 , we generate a proposal:
θ∗ = θn−1 + λε, ε ∼ N (0, 1)
2. We compute:
P (θ∗ )
α = min 1,
P (θn−1 )
3. We set:
θn = θ∗ w .p. α
θn = θn−1 w .p. 1 − α
• Households solve:
c 1−σ
V (t, e, x) = max0 + βEV (t + 1, e 0 , x 0 )
{c,x } 1−σ
s.t.
c + x 0 ≤ (1 + r )x + ew
x0 ≥ 0
t ∈ {1, . . . , T }
12
Computing the model
1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.
13
Computing the model
1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.
2. Backwards induction:
13
Computing the model
1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.
2. Backwards induction:
2.1 For t = T and every xi ∈ X and ej ∈ E , solve the static problem:
V (t, ej , xi ) = max u(c) s.t. c ≤ (1 + r )xi + ej w
{c}
13
Computing the model
1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.
2. Backwards induction:
2.1 For t = T and every xi ∈ X and ej ∈ E , solve the static problem:
V (t, ej , xi ) = max u(c) s.t. c ≤ (1 + r )xi + ej w
{c}
c + x 0 ≤ (1 + r )xi + ej w
P(e 0 ∈ E |ej ) = Γ(ej )
13
Code Structure
for(age = T:-1:1)
for(ix = 1:nx)
for(ie = 1:ne)
VV = -10^3;
for(ixp = 1:nx)
expected = 0.0;
if(age < T)
for(iep = 1:ne)
expected = expected + P[ie, iep]*V[age+1, ixp, iep];
end
end
if(cons <= 0)
utility = -10^5;
end
if(utility >= VV)
VV = utility;
end
end
V[age, ix, ie] = VV;
end
end
end
14
In parallel
1. Set t = T .
2. Given t, the computation of V (t, ej , xi ) is independent of the computation of V (t, ej 0 , xi 0 ), for i 6= i 0 ,
j 6= j 0 .
3. One processor can compute V (t, ej , xi ) while another processor computes V (t, ej 0 , xi 0 ).
4. When the different processors are done at computing V (t, ej , xi ), ∀xi ∈ X and ∀ej ∈ E , set t = t − 1.
5. Go to 1.
Note that the problem is not parallelizable on t. The computation of V (t, e, x) depends on V (t + 1, e, x)!
15
Computational features of the model
17
Parallel execution of the code
18
Many workers instead of one
Figure 1: 1 Core Used for Computation
19
Parallelization limits
20
Costs of parallelization
• Amdahl’s Law: the speedup of a program using multiple processors in parallel computing is limited by
the time needed for the sequential fraction of the program.
• Costs:
• Starting a thread or a process/worker.
• Synchronizing.
• Load imbalance: for large machines, it is often difficult to use more than 10% of its computing power.
21
Parallelization limits on a laptop
22
Multi-core processors
23
Know your limits!
• Spend some time getting to know you laptop’s limits and the problem to parallelize.
• In our life-cycle problem with many grid points, parallelization improves performance almost linearly,
up to the number of physical cores.
• Parallelizing over different threads of the same physical core does not improve speed if each thread
uses 100% of core capacity.
• For computationally heavy problems, adding more threads than cores available may even reduce
performance.
24
Your laptop is not the limit!
• Cluster servers.
• Check: https://aws.amazon.com/ec2/pricing/
25
Running an instance on AWS
• Go to: https://console.aws.amazon.com/
• Click on EC2.
• Click on Launch Instance and follow the window links (for example, Ubuntu Server 18.04).
• Public key:
• Download key.
• Run instance.
26
Working on AWS instance
On Unix/Linux terminal:
$ ssh -i "Harvard_Spring_2018.pem"
[email protected]
27
Parallelization
Programming modes I
1.1 Julia.
1.2 Python.
1.3 R.
1.4 Matlab.
2. Explicit parallelization:
2.1 OpenMP.
2.2 MPI.
28
Programming modes II
2.2 UPC.
2.3 X10.
2.4 Chapel.
6. Hybrids.
29
Flynn’s taxonomy
30
Two ways of parallelizing
1. for loop:
• Create a function that depends on the state variables over which the problem can be parallelized:
• In our example, we have to create a function that computes the value function for a given set of state
variables.
31
Julia
Parallelization in Julia - for loops
32
Parallelization in Julia - for loops
using Distributed
addprocs(5)
3. Remove workers:
rmprocs(2,3,5)
4. Checking workers:
workers()
33
Parallelization in Julia - for loops
using Distributed
using SharedArrays
2. Declare variables used inside the parallel for loop that are not modified inside parallel iterations to be
@everywhere:
@everywhere nx = 1500;
3. Declare variables used inside the parallel for loop that are modified inside parallel iterations as
SharedArray:
tempV = SharedArray{Float64}(ne*nx);
34
Parallelization in Julia - for loops
35
Parallelization in Julia - for loops
ind = currentState.ind
age = currentState.age
# ...
VV = -10.0^3;
ixpopt = 0;
return(VV);
end
36
Parallelization in Julia - for loops
6. For paralellizing a for loop, add @distributed before the for statement:
7. To synchronize before the code continues its execution, add @sync before the @distributed for
statement:
37
Parallelization in Julia - for loops
nx = 350; nx = 350;
ne = 9; ne = 9;
for(ie = 1:ne) for(ix = 1:nx)
@sync @distributed for(ix = 1:nx) @sync @distributed for(ie = 1:ne)
# ... # ...
end end
end end
38
Parallelization in Julia - for loops
• OR convert the problem so all state variables are computed by iterating over a one-dimensional loop:
ix = convert(Int,ceil(ind/ne));
ie = convert(Int,floor(mod(ind-0.05, ne))+1);
# ...
end
39
Parallelization in Julia - Performance
Figure 3: Julia - 1 core used for computation
40
Parallelization in Julia - Performance
13
12
11 Physical cores
10
9
8
Time (s)
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8
# of cores 41
Parallelization in Julia - for loops
42
Parallelization in Julia - Map
43
Parallelization in Julia - Map
for(age = T:-1:1)
pars = [modelState(ix, age, ..., w, r) for ix in 1:nx];
s = pmap(value,pars);
for(ind = 1:nx)
V[age, ix, ie] = s[ix];
end
end
44
Parallelization in Julia - Performance
50
45
40 Physical cores
Time (s) 35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8
# of cores 45
Parallelization in Julia - Final advice
• Wrapping value function computation for every state might significantly increase speed (even more
than parallelizing).
46
Python
Parallelization in Python - Map
class modelState(object):
def __init__(self, age, ix, ...):
self.age = age
self.ix = ix
# ...
47
Parallelization in Python
3. Define a function that computes value for a given input states of type modelState:
def value_func(states):
nx = states.nx
age = states.age
# ...
VV = math.pow(-10, 3)
for ixp in range(0,nx):
# ...
return[VV];
48
Parallelization in Python
results = Parallel(n_jobs=num_cores)(delayed(value_func)
(modelState(ix, age, ..., w, r)) for ind in range(0,nx*ne))
49
Parallelization in Python
5. Life-cycle model:
50
Parallelization in Python - Performance
2600
2400
2200 Physical cores
2000
1800
1600
Time (s)
1400
1200
1000
800
600
400
200
0
1 2 3 4 5 6 7 8
# of cores 51
R
Parallelization in R - Map
library("parallel")
2. Create the structure of parameters for the function that computes the value for a given state as a list:
52
Parallelization in R
3. Create the function that computes the value for a given state:
value = function(x){
age = x$age
ix = x$ix
...
VV = -10^3;
for(ixp in 1:nx){
# ...
}
return(VV);
}
53
Parallelization in R
cl <- makeCluster(no_cores)
5. Use function parLapply(cl, states, value) to compute value at every state in states with cl
cores:
for(age in T:1){
states = lapply(1:nx, ...)
for(ix in 1:nx){
V[age, ix] = s[[ix]][1]
}
}
54
Parallelization in R - Performance
1200
1100
1000 Physical cores
900
800
Time (s)
700
600
500
400
300
200
100
0
1 2 3 4 5 6 7 8
# of cores 55
Matlab
Parallelization in Matlab - for loop
parpool(6)
56
Parallelization in Matlab - Performance
55
50
45 Physical cores
40
35
Time (s)
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8
# of cores 57
Parallelization in Matlab
• Extremely easy.
58
OpenMP
OpenMP I
• Tutorial: https://computing.llnl.gov/tutorials/openMP/
• Using OpenMP: Portable Shared Memory Parallel Programming by Barbara Chapman, Gabriele Jost,
and Ruud van der Pas.
• Fast to learn, reduced set of instructions, easy to code, but you need to worry about contention and
cache coherence.
59
OpenMP II
• API for multi-processor/core, shared memory machines defined by a group of major computer
hardware and software vendors.
60
OpenMP III
• Race conditions: you can impose fence conditions and/or make some data private to the thread.
61
Fork-join
62
Parallelization in C++ using OpenMP
-fopenmp
export OMP_NUM_THREADS=32
4. We can always recompile without the flag and compiler directives are ignored.
5. Most implementations (although not the standard!) allow for nested parallelization and dynamic
63
thread changes.
Parallelization in C++ using OpenMP - Performance
6.0
5.5 Physical cores
5.0
4.5
4.0
Time (s)
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8
# of cores 64
Parallelization in Rcpp using OpenMP
2. In the C++ code, add the following line to any function that you want to import from R:
// [[Rcpp::export]]
library("Rcpp")
65
Parallelization in Rcpp using OpenMP
Sys.setenv("OMP_NUM_THREADS"="8")
Sys.setenv("PKG_CXXFLAGS"=" -fopenmp")
sourceCpp("my_file.cpp")
66
MPI
MPI I
• Message Passing Interface (MPI) is a standardized and portable message-passing system based on the
consensus of the MPI Forum.
• Tutorial: https://computing.llnl.gov/tutorials/mpi/
• A couple of references:
1. Using MPI : Portable Parallel Programming with the Message Passing Interface (2nd edition) by William
Gropp, Ewing L. Lusk, and Anthony Skjellum.
67
MPI II
• Bindings for C++ and Fortran. Also for Python, Julia, R, and other languages.
• Harder to learn (MPI 3.0 standard has more than 440 routines) and code, but extremely powerful ⇒
used for state-of-the-art computations.
68
1 2 3 4
Send mα
Receive mα
Send mβ
Receive mβ
Send mγ
Receive mγ
In particular, MPI can broadcast (transfers an array from the master to every thread),69
70
MPI III
71
Example code
#include "mpi.h"
#include <iostream>
int main( int argc, char *argv[] ){
int rank, size;
MPI::Init(argc, argv);
rank = MPI::COMM\_WORLD.Get\_rank();
size = MPI::COMM\_WORLD.Get\_size();
std::cout<< "I am " << rank << " of " << size << \qquad "n";
MPI::Finalize();
return 0;
}
72
Routines
• Communication:
3. Compute and move (sum, product, max of, . . . ) data on many processors.
• Synchronization.
• Enquiries:
73
MPI derived types
1. MPI_CHAR
2. MPI_DOUBLE_PRECISION
3. MPI_C_DOUBLE_COMPLEX
74
Parallelization in C++ using MPI - Performance
7
Physical cores
6
5
Time (s)
0
1 2 3 4 5 6 7 8
# of cores 75
GPUs
Big difference
• Intermediate: Coprocessors.
76
77
78
79
Floating-point operations per second for the CPU and GPU
80
When to go to the GPU?
Remember: a GPU is attached to the CPU via a PCI (Peripheral Component Interconnect) Express bus.
81
CUDA, OpenCL, and OpenACC
3. OpenACC.
82
References
• Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics
processors.
• Programming Massively Parallel Processors: A Hands-on (3rd ed.) by David B. Kirk and Wen-mei
W. Hwu.
• Heterogeneous Computing with OpenCL 2.0 by David R. Kaeli and Perhaad Mistry.
• OpenACC for Programmers: Concepts and Strategies by Sunita Chandrasekaran and Guido
Juckeland.
83
CUDA
CUDA
84
CUDA: advantages
• Brings C++11, since version 7, and C++14, since version 9, language features, albeit only a subset
available in Device.
• Rapidly expanding third-party libraries: OpenCV machine learning, CULA linear algebra, HIPLAR linear
algebra for R.
• Enter Thrust:
3. A few lines of code to perform GPU-accelerated sort, scan, transform, and reduction operations
85
Example code
86
CUDA: disadvantages
• Limited community, most information comes from Nvidia and third-party developers.
87
Additional resources
• Books:
• References:
1. https://developer.nvidia.com/cuda-zone.
2. https://developer.nvidia.com/thrust.
3. https://devblogs.nvidia.com/
88
Thrust
Thrust I
• Thrust brings the power of GPUs to the masses (at least those familiar with C++).
• Thrust is a parallel algorithms library in the spirit of C++’s Standard Template Library.
1. “can be implemented efficiently without a detailed mapping to the target architecture,” and
2. “don’t merit or won’t receive (for whatever reason) significant optimization attention from the
programmer.”
• The idea is that the programmer spends more time on the problem, rather than on the
implementation of the algorithms solving the problem.
89
Thrust II
• Thrust incorporates tuned implementation for each backend: CUDA, OpenMP, and TBB
• This results in portability across parallel frameworks and hardware architecture without losing
performance.
1 Intel’s TBB – Threading Building Blocks – is a C++ template library for task parallelism.
90
Thrust III
• Thrust is “entirely defined in header files.” Hence, each modification in code requires recompilation.
• Documentation is limited and mostly based on examples. But it has improved over the years.
• Although the last release, version 1.8.1, dates back to 2015, Nvidia seems to be working on an
update.2
2 https://www.reddit.com/r/cpp/comments/7erub1/anybody_still_using_thrust/
91
Example code
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>
int main(void){
// generate random data serially
thrust::host_vector<int> h_vec(100);
std::generate(h_vec.begin(), h_vec.end(), rand);
// transfer to device and compute sum
thrust::device_vector<int> d_vec= h_vec;
int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0,
thrust::plus<int>());
return 0;
}
92
Thrust IV
• Transformations are “algorithms that apply an operation to each element in a set of input ranges
and stores result in destination range.”
• Compute Y = −X :
• Compute Y = Xmod2
94
Thrust V – Algorithms
3. thrust::max_element,,
4. thrust::inner_product,,
95
Thrust VI – Algorithms
• Thrust offers many more algorithms to, for example, reordering, sorting, and prefix-sums.
2. transform_iterator,
zip_iterator can be used to create “a virtual array of 3d vectors” that can be fed to other algorithms.
96
Additional resources
• Books:
• References:
1. https://devblogs.nvidia.com/expressive-algorithmic-programming-thrust/.
2. https://github.com/thrust/thrust/wiki/Quick-Start-Guide.
3. http://thrust.github.io/.
4. https://docs.nvidia.com/cuda/thrust/index.html
97
OpenACC
OpenACC I
• Main idea is to take existing serial code, say C++, and give hints to compiler to what should be
parallelized.
• OpenACC is a model designed to allow parallel programming across different computer architectures
with minimum effort by the developer. Portability means that the code should be independent of
hardware/compiler.
• OpenACC specification supports C/C++ and Fortran and runs in CPUs and GPUs.
98
OpenACC II
• OpenACC is built around a very simple set of directives, very similar in design to OpenMP: OpenACC
uses the fork-join paradigm.
• The same program can be compiled to be executed in parallel using the CPU or the GPU (or mixing
them), depending on the hardware available.
• Communication between the master and worker threads in the parallel pool is automatically handled,
although the user can state directives to grant explicit access to variables, and to transfer objects
from the CPU to the GPU when required.
• The OpenACC website describes multiple compilers, profilers, and debuggers for OpenACC.
• We use the PGI Community Edition compiler. The PGI compiler can be used with the CPUs and
with NVIDIA Tesla GPUs. In this way, it suffices to select different flags at compilation time to
execute the code in the CPU or the GPU.
99
Using OpenACC I: Analyze
• Use a profiler to check where your code spends lots of time. Example of bottlenecks are loops.
• Check if there is an optimized library that implements some of your code: cuBlas, Armadillo.
100
Using OpenACC II: Parallelize
• Expose your code to parallelism starting with functions/operations that are time consuming on CPU.
• To execute a kernel:
• To parellelize a loop:
101
Example
• By choosing the appropriate compilation flag, we can compile the code to be executed in parallel only
in the CPU or in the GPU.
• To compile the code to be executed by the CPU, we must include the -ta=multicore flag at
compilation. In addition, the -acc flag must be added to tell the compiler this is OpenACC code:
• If, instead, we want to execute the program with an NVIDIA GPU, we rely on the -ta=nvidia flag:
103
Using OpenACC III: Optimize
• Give info to compiler of parts that can be optimize: data management (minimize copying between
host and device).
104
Additional resources
• Books:
• References:
1. https://devblogs.nvidia.com/tag/openacc/.
2. https://www.openacc.org/.
105
Comparisons
Comparisons I
• Vectorizing in Matlab.
• Etc.
106
Comparisons II
• The comparisons regarding parallelization are specific to the packages used on these slides:
107
Advice
• Short-run:
MATLAB, Python, Julia, or R
• Medium-run:
Rcpp
• Long-run:
C++ with OpenMP, MPI, or GPUs
108
10.0 Julia Par
Physical cores
7.5
Rcpp
Time (s)
5.0
OpenMP
MPI
2.5
CUDA
OpenAcc
1 2 3 4 5 6 7 8
# of cores 109