0% found this document useful (0 votes)

66 views

Lecture HPC 11 Parallelization

The document discusses parallel programming and its motivation. Moore's law is causing issues as transistor size decreases, like increased electricity consumption and heat. Building faster serial machines is impossible due to physical limits. The alternative is to have more processors working in parallel. Parallel programming involves dividing a complex problem into easier parts that can be solved simultaneously, like matrix multiplication or computing moments in MapReduce. There are two main issues: algorithms and coding. When problems are easy to parallelize depends on how they are set up and the architecture being used.

Uploaded by

Aldo Ndun Aveiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

Lecture HPC 11 Parallelization

Uploaded by

Aldo Ndun Aveiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 128

Parallel Programming

(Lectures on High-performance Computing for Economists XI)

Jesús Fernández-Villaverde,1 Pablo Guerrón,2 and David Zarruk Valencia3

November 18, 2019
1
University of Pennsylvania
2
Boston College
3
ITAM
Motivation
Why parallel?

• Moore’s Law (1965): transistor density of semiconductor chips would double roughly every 18 months.

• Problems when transistor size falls by a factor x:

1. Electricity consumption goes up by x 4 .
2. Heat goes up.
3. Manufacturing costs go up.

• Inherent limits on serial machines imposed by the speed of light (30 cm/ns) and transmission limit of
copper wire (9 cm/ns): virtually impossible to build a serial Teraflop machine with current approach.

• Furthermore, real bottleneck is often memory access (RAM latency has only improved around 10% a
year).

• Alternative: having more processors!

1
Number of transistors

●
10,000,000,000 ● ●
● ● ●
●
●
● ● ● ●
● ● ● ●
●
● ● ●
● ● ●
●
● ● ● ●
●
● ● ●
●
● ● ●
● ● ●
1,000,000,000 ● ● ●
●
●
● ●
● ●
●
●
● ● ● ●
● ● ●
100,000,000 ● ●
● ● ●
● ●
Transistors

● ●
● ●
10,000,000 ●
● ●
●
●
●
● ● ●

● ● ●
1,000,000
● ●
● ● ● ●
● ●
● ●
100,000 ● ●
●
● ●
● ● ●
●
●
10,000 ● ●
●
●
●
●
● ●
●
●

1,000
1971 1976 1981 1986 1991 1996 2001 2006 2011 2016
Year 2
Cray-1, 1975

3
IBM Summit, 2018

4
Parallel programming

• Main idea ⇒ divide a complex problem into easier parts:

1. Numerical computation =⇒ matrix multiplication.

2. Data handling (MapReduce and Spark) =⇒ computing moments.

• Two issues:

1. Algorithms.

2. Coding.

5
Some references

• Introduction to High Performance Computing for Scientists and Engineers by Georg Hager and
Gerhard Wellein.

• Parallel Computing for Data Science: With Examples in R, C++ and CUDA, by Norman Matloff.

• Parallel Programming: Concepts and Practice by Bertil Schmidt, Jorge González-Domínguez, and
Christian Hundt.

• An Introduction to Parallel Programming by Peter Pacheco.

• Principles of Parallel Programming by Calvin Lin and Larry Snyder.

• Structured Parallel Programming: Patterns for Efficient Computation by Michael McCool, James
Reinders, and Arch Robison.
6
When do we parallelize? I

• Scalability:

1. Strongly scalable: problems that are inherently easy to parallelize.

2. Weakly scalable: problems that are not.

• Granularity:

1. Coarse: more computation than communication.

2. Fine: more communication.

• Overheads and load balancing.

7
Granularity

8
When do we parallelize? II

• Whether or not the problem is easy to parallelize may depend on the way you set it up.

• Taking advantage of your architecture.

• Trade off between speed up and coding time.

• Debugging and profiling may be challenging.

• You will need a good IDE, debugger, and profiler.

9
Example I: value function iteration

V (k) = max
0
{u (c) + βV (k 0 )}
k

c = k + (1 − δ) k − k 0
α

1. We have a grid of capital with 100 points, k ∈ [k1 , k2 , ..., k100 ] .

2. We have a current guess V n (k) .
3. We can send the problem:
max
0
{u (c) + βV n (k 0 )}
k

c = k1α + (1 − δ) k1 − k 0
to processor 1 to get V n+1 (k1 ) .
4. We can send similar problem for each k to each processor.
5. When all processors are done, we gather the V n+1 (k1 ) back.
10
Example II: random walk Metropolis-Hastings

• Draw θ ∼ P (·)
• How?
1. Given a state of the chain θn−1 , we generate a proposal:
θ∗ = θn−1 + λε, ε ∼ N (0, 1)

2. We compute:
P (θ∗ )

α = min 1,
P (θn−1 )

3. We set:
θn = θ∗ w .p. α
θn = θn−1 w .p. 1 − α

• Problem: to generate θ∗ we need to θn−1 .

• No obvious fix (parallel chains violate the asymptotic properties of the chain).
11
The Model
Life-cycle model

• Households solve:

c 1−σ
V (t, e, x) = max0 + βEV (t + 1, e 0 , x 0 )
{c,x } 1−σ

s.t.

c + x 0 ≤ (1 + r )x + ew

P(e 0 |e) = Γ(e)

x0 ≥ 0

t ∈ {1, . . . , T }

12
Computing the model

1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.

13
Computing the model

1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.

2. Backwards induction:

13
Computing the model

1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.

2. Backwards induction:
2.1 For t = T and every xi ∈ X and ej ∈ E , solve the static problem:
V (t, ej , xi ) = max u(c) s.t. c ≤ (1 + r )xi + ej w
{c}

13
Computing the model

1. Choose grids for assets X = {x1 , . . . , xnx } and shocks E = {e1 , . . . , ene }.

2. Backwards induction:
2.1 For t = T and every xi ∈ X and ej ∈ E , solve the static problem:
V (t, ej , xi ) = max u(c) s.t. c ≤ (1 + r )xi + ej w
{c}

2.2 For t = T − 1, . . . , 1, use V (t + 1, ej , xi ) to solve:

V (t, ej , xi ) = max u(c) + βEV (t + 1, e 0 , x 0 ) s.t.

{c,x 0 ∈X }

c + x 0 ≤ (1 + r )xi + ej w
P(e 0 ∈ E |ej ) = Γ(ej )

13
Code Structure

for(age = T:-1:1)
for(ix = 1:nx)
for(ie = 1:ne)
VV = -10^3;
for(ixp = 1:nx)

expected = 0.0;
if(age < T)
for(iep = 1:ne)
expected = expected + P[ie, iep]*V[age+1, ixp, iep];
end
end

cons = (1+r)xgrid[ix] + egrid[ie]w - xgrid[ixp];

utility = (cons^(1-ssigma))/(1-ssigma) + bbeta*expected;

if(cons <= 0)
utility = -10^5;
end
if(utility >= VV)
VV = utility;
end
end
V[age, ix, ie] = VV;
end
end
end

14
In parallel

1. Set t = T .
2. Given t, the computation of V (t, ej , xi ) is independent of the computation of V (t, ej 0 , xi 0 ), for i 6= i 0 ,
j 6= j 0 .
3. One processor can compute V (t, ej , xi ) while another processor computes V (t, ej 0 , xi 0 ).
4. When the different processors are done at computing V (t, ej , xi ), ∀xi ∈ X and ∀ej ∈ E , set t = t − 1.
5. Go to 1.

Note that the problem is not parallelizable on t. The computation of V (t, e, x) depends on V (t + 1, e, x)!

15
Computational features of the model

1. The simplest life-cycle model.

2. Three state variables:
2.1 Age.
2.2 Assets.
2.3 Productivity shock.
3. Parallelizable only on assets and shock, not on age.
4. May become infeasible to estimate:
4.1 With more state variables:
• Health.
• Housing.
• Money.
• Different assets.

4.2 If embedded in a general equilibrium.

16
In parallel

17
Parallel execution of the code

18
Many workers instead of one
Figure 1: 1 Core Used for Computation

Figure 2: 8 Cores Used for Computation

19
Parallelization limits
20
Costs of parallelization

• Amdahl’s Law: the speedup of a program using multiple processors in parallel computing is limited by
the time needed for the sequential fraction of the program.

• Costs:
• Starting a thread or a process/worker.

• Transferring shared data to workers.

• Synchronizing.

• Load imbalance: for large machines, it is often difficult to use more than 10% of its computing power.

21
Parallelization limits on a laptop

• Newest processors have plenty of processor.

• For example, in these slides, we used 4 physical cores + 4 virtual cores = 8 logical cores.

22
Multi-core processors

23
Know your limits!

• Spend some time getting to know you laptop’s limits and the problem to parallelize.

• In our life-cycle problem with many grid points, parallelization improves performance almost linearly,
up to the number of physical cores.

• Parallelizing over different threads of the same physical core does not improve speed if each thread
uses 100% of core capacity.

• For computationally heavy problems, adding more threads than cores available may even reduce
performance.

24
Your laptop is not the limit!

• Cluster servers.

• Amazon Web Services - EC2 at https://aws.amazon.com/ec2/:

• Almost as big as you want!

• Replace a large initial capital cost for a variable cost (use-as-needed).

• Check: https://aws.amazon.com/ec2/pricing/

• 8 processors with 32Gb, general purpose: $0.332 per hour.

• 64 processors with 256Gb, compute optimized: $3.20 per hour.

25
Running an instance on AWS

• Go to: https://console.aws.amazon.com/

• Click on EC2.

• Click on Launch Instance and follow the window links (for example, Ubuntu Server 18.04).

• Public key:

• Create a new key pair.

• Download key.

• Store it in a secure place (usually ∼./ssh/).

• Run instance.

26
Working on AWS instance

On Unix/Linux terminal:

• Transfer folder from local to instance with scp:

$ scp -i "/path/"Harvard_Spring_2018.pem"" -r "/pathfrom/FOLDER/"

[email protected]:~

• Make sure key is not publicly available:

$ chmod 400 "Harvard_Spring_2018.pem"

• Connect to instance with ssh:

$ ssh -i "Harvard_Spring_2018.pem"
[email protected]

27
Parallelization
Programming modes I

• More common in economics.

1. Packages/libraries/toolboxes within languages:

1.1 Julia.

1.2 Python.

1.3 R.

1.4 Matlab.

2. Explicit parallelization:

2.1 OpenMP.

2.2 MPI.

2.3 GPU programming: CUDA, OpenCL, and OpenACC.

28
Programming modes II

• Less common in economics.

1. Automatic parallelization: AutoParInGCC, Intel compilers.

2. Partitioned Global Address Space Languages (PGAS):

2.1 Coarray Fortran.

2.2 UPC.

2.3 X10.

2.4 Chapel.

3. Pthreads (POSIX threads).

4. TPUs (Tensor processing units).

5. FPGAs (field programmable gate arrays).

6. Hybrids.
29
Flynn’s taxonomy

30
Two ways of parallelizing

1. for loop:

• Adding a statement before a for loop that wants to be parallelized.

2. Map and reduce:

• Create a function that depends on the state variables over which the problem can be parallelized:

• In our example, we have to create a function that computes the value function for a given set of state
variables.

• Map computes in parallel the function at a vector of states.

• Reduce combines the values returned by map in the desired way.

31
Julia
Parallelization in Julia - for loops

• Parallelization of for loops is worth for “small tasks.”

• “Small task” == “few computations on each parallel iteration”:

• Few control variables.

• Few grid points on control variables.

• Our model is a “small task.”

32
Parallelization in Julia - for loops

1. Load distributed module

using Distributed

2. Set number of workers:

addprocs(5)

3. Remove workers:

rmprocs(2,3,5)

4. Checking workers:

workers()

33
Parallelization in Julia - for loops

1. Load distributed and SharedArrays modules

using Distributed
using SharedArrays

2. Declare variables used inside the parallel for loop that are not modified inside parallel iterations to be
@everywhere:

@everywhere nx = 1500;

3. Declare variables used inside the parallel for loop that are modified inside parallel iterations as
SharedArray:

tempV = SharedArray{Float64}(ne*nx);

34
Parallelization in Julia - for loops

4. Data structure of state and exogenous variables

@everywhere struct modelState

ind::Int64
ne::Int64
nx::Int64
T::Int64
age::Int64
P::Array{Float64,2}
xgrid::Vector{Float64}
egrid::Vector{Float64}
ssigma::Float64
bbeta::Float64
V::Array{Float64,2}
w::Float64
r::Float64
end

35
Parallelization in Julia - for loops

5. Define a function that computes value function for a given state:

@everywhere function value(currentState::modelState)

ind = currentState.ind
age = currentState.age
# ...
VV = -10.0^3;

ixpopt = 0;

for ixp = 1:nx

# ...
end

return(VV);

end

36
Parallelization in Julia - for loops

6. For paralellizing a for loop, add @distributed before the for statement:

@distributed for ind = 1:(ne*nx)

# ...
end

7. To synchronize before the code continues its execution, add @sync before the @distributed for
statement:

@sync @distributed for ind = 1:(ne*nx)

# ...
end

37
Parallelization in Julia - for loops

• Choose appropriately the dimension(s) to parallelize:

nx = 350; nx = 350;
ne = 9; ne = 9;
for(ie = 1:ne) for(ix = 1:nx)
@sync @distributed for(ix = 1:nx) @sync @distributed for(ie = 1:ne)
# ... # ...
end end
end end

• The first one is much faster, as there is less communication.

38
Parallelization in Julia - for loops

• OR convert the problem so all state variables are computed by iterating over a one-dimensional loop:

@sync @distributed for ind = 1:(ne*nx)

ix = convert(Int,ceil(ind/ne));
ie = convert(Int,floor(mod(ind-0.05, ne))+1);
# ...

end

• Communication time is minimized!

39
Parallelization in Julia - Performance
Figure 3: Julia - 1 core used for computation

Figure 4: Julia - 8 cores used for computation

40
Parallelization in Julia - Performance

13
12
11 Physical cores
10
9
8
Time (s)

7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8
# of cores 41
Parallelization in Julia - for loops

• Speed decreases with the number of global variables used.

• Very sensible to the use of large SharedArray objects.

• Can be faster without paralellization than with large shared objects.

• See code on github

42
Parallelization in Julia - Map

• Problems with more computations per iteration.

• Value function/life-cycle models with more computations per state:

• Many control variables.

• Discrete choice (marry-not marry, accept-reject work offer, default-repay, etc.).

• If problem is “small”, using map for parallelization is slower.

• See examples 3 and 4 on github.

43
Parallelization in Julia - Map

1. Most of the code as as in the for case.

2. The function pmap(f,s) computes the function f at every element of s in parallel:

for(age = T:-1:1)
pars = [modelState(ix, age, ..., w, r) for ix in 1:nx];
s = pmap(value,pars);
for(ind = 1:nx)
V[age, ix, ie] = s[ix];
end
end

44
Parallelization in Julia - Performance

50
45
40 Physical cores
Time (s) 35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8
# of cores 45
Parallelization in Julia - Final advice

• Assess size of problem, but usually problem grows as paper evolves!

• Wrapping value function computation for every state might significantly increase speed (even more
than parallelizing).

46
Python
Parallelization in Python - Map

1. Use joblib package

from joblib import Parallel, delayed

import multiprocessing

2. Define a parameter structure for value function computation:

class modelState(object):
def __init__(self, age, ix, ...):
self.age = age
self.ix = ix
# ...

47
Parallelization in Python

3. Define a function that computes value for a given input states of type modelState:

def value_func(states):
nx = states.nx
age = states.age
# ...
VV = math.pow(-10, 3)
for ixp in range(0,nx):
# ...
return[VV];

48
Parallelization in Python

4. The function Parallel:

results = Parallel(n_jobs=num_cores)(delayed(value_func)
(modelState(ix, age, ..., w, r)) for ind in range(0,nx*ne))

maps the function value_func at every element of modelState(ix, age, . . . , w, r) in parallel

using num_cores cores.

49
Parallelization in Python

5. Life-cycle model:

for age in reversed(range(0,T)):

results = Parallel(n_jobs=num_cores)(delayed(value_func)
(modelState(ix, age, ..., w, r)) for ix in range(0,nx))
for ix in range(0,nx):
V[age, ix] = results[ix][0];

50
Parallelization in Python - Performance

2600
2400
2200 Physical cores
2000
1800
1600
Time (s)

1400
1200
1000
800
600
400
200
0
1 2 3 4 5 6 7 8
# of cores 51
R
Parallelization in R - Map

1. Use package parallel:

library("parallel")

2. Create the structure of parameters for the function that computes the value for a given state as a list:

states = lapply(1:nx, function(x) list(age=age,ix=x, ...,r=r))

52
Parallelization in R

3. Create the function that computes the value for a given state:

value = function(x){
age = x$age
ix = x$ix
...
VV = -10^3;
for(ixp in 1:nx){
# ...
}
return(VV);
}

53
Parallelization in R

4. Define the cluster with desired number of cores:

cl <- makeCluster(no_cores)

5. Use function parLapply(cl, states, value) to compute value at every state in states with cl
cores:

for(age in T:1){
states = lapply(1:nx, ...)
for(ix in 1:nx){
V[age, ix] = s[[ix]][1]
}
}

54
Parallelization in R - Performance

1200
1100
1000 Physical cores
900
800
Time (s)

700
600
500
400
300
200
100
0
1 2 3 4 5 6 7 8
# of cores 55
Matlab
Parallelization in Matlab - for loop

Using the parallel toolbox:

1. Initialize number of workers with parpool():

parpool(6)

2. Replace the for loop with parfor:

for age = T:-1:1

parfor ie = 1:1:ne
% ...
end
end

56
Parallelization in Matlab - Performance

55
50
45 Physical cores
40
35
Time (s)

30
25
20
15
10
5
0
1 2 3 4 5 6 7 8
# of cores 57
Parallelization in Matlab

• Extremely easy.

• Also simple to extend to GPU.

• There is no free lunch =⇒ very poor performance.

58
OpenMP
OpenMP I

• Open specifications for multi-processing.

• It has been around for two decades. Current version 4.5.

• Official web page: http://openmp.org/wp/

• Tutorial: https://computing.llnl.gov/tutorials/openMP/

• Using OpenMP: Portable Shared Memory Parallel Programming by Barbara Chapman, Gabriele Jost,
and Ruud van der Pas.

• Fast to learn, reduced set of instructions, easy to code, but you need to worry about contention and
cache coherence.
59
OpenMP II

• API for multi-processor/core, shared memory machines defined by a group of major computer
hardware and software vendors.

• C++ and Fortran. Extensions to other languages.

• For example, you can have OpenMP in Mex files in Matlab.

• Supported by major compilers (GCC) and IDEs (Clion).

• Thus, it is usually straightforward to start working with it.

60
OpenMP III

• Multithreading with fork-join.

• Rule of thumb: One thread per processor.

• Job of the user to remove dependencies and syncronize data.

• Heap and stack (LIFO).

• Race conditions: you can impose fence conditions and/or make some data private to the thread.

• Remember: synchronization is expensive and loops suffer from overheads.

61
Fork-join

62
Parallelization in C++ using OpenMP

1. At compilation, add flag:

-fopenmp

2. Set environmental variable OMP_NUM_THREADS:

export OMP_NUM_THREADS=32

3. Add line before loop:

#pragma omp parallel for shared(V, ...) private(VV, ...)

for(int ix=0; ix<nx; ix++){
// ...
}

4. We can always recompile without the flag and compiler directives are ignored.

5. Most implementations (although not the standard!) allow for nested parallelization and dynamic
63
thread changes.
Parallelization in C++ using OpenMP - Performance

6.0
5.5 Physical cores
5.0
4.5
4.0
Time (s)

3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8
# of cores 64
Parallelization in Rcpp using OpenMP

1. Write your code in C++, adding the parallelization statement

#pragma omp parallel for shared(...) private(...)

2. In the C++ code, add the following line to any function that you want to import from R:

// [[Rcpp::export]]

3. In R, load the Rcpp package:

library("Rcpp")

65
Parallelization in Rcpp using OpenMP

4. Set the environmental variable OMP_NUM_THREADS using the Sys.setenv() function:

Sys.setenv("OMP_NUM_THREADS"="8")

5. Add the −fopenmp flag using Sys.setenv() function:

Sys.setenv("PKG_CXXFLAGS"=" -fopenmp")

6. Compile and import using sourceCpp:

sourceCpp("my_file.cpp")

66
MPI
MPI I

• Message Passing Interface (MPI) is a standardized and portable message-passing system based on the
consensus of the MPI Forum.

• Official web page (and for downloads): http://www.open-mpi.org/

• Tutorial: https://computing.llnl.gov/tutorials/mpi/

• A couple of references:

1. Using MPI : Portable Parallel Programming with the Message Passing Interface (2nd edition) by William
Gropp, Ewing L. Lusk, and Anthony Skjellum.

2. MPI: The Complete Reference - Volumes 1 and 2, by several authors.

67
MPI II

• MPI is organized as a library performed with routine calls.

• Bindings for C++ and Fortran. Also for Python, Julia, R, and other languages.

• For example, you can have MPI in Mex files in Matlab.

• Harder to learn (MPI 3.0 standard has more than 440 routines) and code, but extremely powerful ⇒
used for state-of-the-art computations.

• Multiple processes (thread with its own controller).

• Thus, better for coarse parallelization.

68
1 2 3 4

Send mα
Receive mα
Send mβ

Receive mβ
Send mγ

Receive mγ

Figure 16: MPI computing

In particular, MPI can broadcast (transfers an array from the master to every thread),69
70
MPI III

• Invoked with a compiler wrapper

mpic++ -o ClassMPI ClassMPI.cpp

• Plenty of libraries (PLAPACK, Boost.MPI).

• Parallel I/O features.

71
Example code

#include "mpi.h"
#include <iostream>
int main( int argc, char *argv[] ){
int rank, size;
MPI::Init(argc, argv);
rank = MPI::COMM\_WORLD.Get\_rank();
size = MPI::COMM\_WORLD.Get\_size();
std::cout<< "I am " << rank << " of " << size << \qquad "n";
MPI::Finalize();
return 0;
}

72
Routines

• Communication:

1. Send and receive: between two processors.

2. Broadcast, scatter, and gather data on all processors.

3. Compute and move (sum, product, max of, . . . ) data on many processors.

• Synchronization.

• Enquiries:

1. How many processes?

2. Which process is this one?

3. Are all messages here?

73
MPI derived types

• MPI predefines its primitive data types:

1. MPI_CHAR

2. MPI_DOUBLE_PRECISION

3. MPI_C_DOUBLE_COMPLEX

• Also for structs and vectors.

• Particularly important for top performance.

74
Parallelization in C++ using MPI - Performance

7
Physical cores
6

5
Time (s)

0
1 2 3 4 5 6 7 8
# of cores 75
GPUs
Big difference

• Latency: amount of time required to complete a unit of work.

• Throughput: amount of work completed per unit of time.

• Latency devices: CPU cores.

• Throughput devices: GPU cores.

• Intermediate: Coprocessors.

• Nature of your application?

76
77
78
79
Floating-point operations per second for the CPU and GPU

80
When to go to the GPU?

1. Problem is easily scalable because computation is massively parallel.

2. Much more time spent on computation than on communication.

Remember: a GPU is attached to the CPU via a PCI (Peripheral Component Interconnect) Express bus.

81
CUDA, OpenCL, and OpenACC

• Three approaches to code in the GPU:

1. CUDA (Compute Unified Device Architecture):

2. OpenCL (Open Computing Language).

3. OpenACC.

• Also, using some of the packages/libraries of languages such as R or Matlab.

82
References

• Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics
processors.

• Programming Massively Parallel Processors: A Hands-on (3rd ed.) by David B. Kirk and Wen-mei
W. Hwu.

• CUDA Handbook: A Comprehensive Guide to GPU Programming by Nicholas Wilt.

• Heterogeneous Computing with OpenCL 2.0 by David R. Kaeli and Perhaad Mistry.

• OpenACC for Programmers: Concepts and Strategies by Sunita Chandrasekaran and Guido
Juckeland.

83
CUDA
CUDA

• CUDA (ComputeUnifiedDeviceArchitecture) was created by Nvidia to facilitate GPU programming.

• It is based on C/C++ with a set of extensions to enable heterogenous programing.

• Introduced in 2007, it is being actively developed (current version 10).

• Many toolboxes: cuBlas, curand, cuSparse, thrust.

• It can be access from other languages such as Fortran, Matlab, Python.

• Widely used in data mining, computer vision, medical imaging, bio-informatics.

84
CUDA: advantages

• Massive acceleration for parallelizable problems.

• Brings C++11, since version 7, and C++14, since version 9, language features, albeit only a subset
available in Device.

• Fast shared memory that can be accessed by threads.

• Rapidly expanding third-party libraries: OpenCV machine learning, CULA linear algebra, HIPLAR linear
algebra for R.

• Enter Thrust:

1. Library of parallel algorithms and data structures.

2. Flexible, high-level interface for GPU programming.

3. A few lines of code to perform GPU-accelerated sort, scan, transform, and reduction operations
85
Example code

// Functions to be executed only from GPU

__device__ float utility(float consumption, float
ssigma){
float utility = pow(cons, 1-ssigma) / (1-ssigma);
// ...
return(utility);
}
// Functions to be executed from CPU and GPU
__global__ float value(parameters params, float* V,
...){
// ...
}

86
CUDA: disadvantages

• Runs only in Nvidia devices.

• High startup cost. Tricky to program even for experienced programmers.

• Tracking host and device codes.

• Demands knowledge of architecture: grid, blocks, threats. Memory management.

• Copying between host and device may reduce speed gains.

• Not all applications benefit from parallelization.

• Limited community, most information comes from Nvidia and third-party developers.

87
Additional resources

Some additional books and references for CUDA programming.

• Books:

1. CUDA by Example, by Jason Sanders and Edward Kandrot.

2. CUDA C Programming, by John Cheng, Max Grossman, and Ty McKercher.

• References:

1. https://developer.nvidia.com/cuda-zone.

2. https://developer.nvidia.com/thrust.

3. https://devblogs.nvidia.com/

88
Thrust
Thrust I

• Thrust brings the power of GPUs to the masses (at least those familiar with C++).

• Thrust is a parallel algorithms library in the spirit of C++’s Standard Template Library.

• Thrust’s main goal is to solve problems that

1. “can be implemented efficiently without a detailed mapping to the target architecture,” and

2. “don’t merit or won’t receive (for whatever reason) significant optimization attention from the
programmer.”

• The idea is that the programmer spends more time on the problem, rather than on the
implementation of the algorithms solving the problem.

89
Thrust II

• Low-level costumization and easy interaction with CUDA, OpenMP, or TBB.1

• Thrust has two main features.

1. An STL-style vector container for host and device, and

2. A set of high-level algorithms for copying, merging, sorting, transforming.

• Thrust can be used for parallel computing for multicore CPUs.

• Thrust incorporates tuned implementation for each backend: CUDA, OpenMP, and TBB

• This results in portability across parallel frameworks and hardware architecture without losing
performance.
1 Intel’s TBB – Threading Building Blocks – is a C++ template library for task parallelism.
90
Thrust III

• Of course, Thrust has limitations.

• No multidimensional data structures libraries.

• Thrust is “entirely defined in header files.” Hence, each modification in code requires recompilation.

• Thrust is not for situations in which performance, customization are crucial.

• Documentation is limited and mostly based on examples. But it has improved over the years.

• Although the last release, version 1.8.1, dates back to 2015, Nvidia seems to be working on an
update.2
2 https://www.reddit.com/r/cpp/comments/7erub1/anybody_still_using_thrust/

91
Example code

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>

int main(void){
// generate random data serially
thrust::host_vector<int> h_vec(100);
std::generate(h_vec.begin(), h_vec.end(), rand);
// transfer to device and compute sum
thrust::device_vector<int> d_vec= h_vec;
int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0,
thrust::plus<int>());
return 0;
}
92
Thrust IV

• Let’s take a more detailed peek at some of Thrust’s capabilities.

• Thrust provides two vector containers:

1. host_vector stored in the CPU’s memory

thrust::host_vector<int> hexample(10,1) // host vector with 10 elements set to 1

2. device_vector resides in the GPU’s device memory.

thrust::device_vector<int> dexample(hexample.begin(),hexample.begin()+5) // device vector with

first 5 elements of hexample

• Some algorithms that operate on vectors are: thrust::fill(), thrust::copy(), thrust::sequence().

• Last algorithm creates a sequence of equally spaced values.

93
Thrust V – Algorithms

• Transformations are “algorithms that apply an operation to each element in a set of input ranges
and stores result in destination range.”

• Compute Y = −X :

thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());

• Compute Y = Xmod2

thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());

94
Thrust V – Algorithms

• Reduction uses “a binary operation to reduce an input sequence to a single value.”

• Sum elements in device vector Y :

int sum = thrust::reduce(Y.begin(), Y.end(), (int) 0, thrust::plus<int>());

• Thrust includes other reduction operations:

1. thrust::count number of instances of specific value,

2. thrust::min_element // find minimum in vector,

3. thrust::max_element,,

4. thrust::inner_product,,

95
Thrust VI – Algorithms

• Thrust offers many more algorithms to, for example, reordering, sorting, and prefix-sums.

• Another important feature in thrust is the fancy iterators.

1. thrust::constant_iterator< > iterator returns same value when dereference it,

2. transform_iterator,

3. permutation_iterator fuse, gather, and scatter operations with thrust algorithms,

4. zip_iterator takes multiple input sequences and yields a sequence of tuples.

zip_iterator can be used to create “a virtual array of 3d vectors” that can be fed to other algorithms.

96
Additional resources

Some books and references for thrust programming.

• Books:

1. Sorry! No books that we are aware of.

• References:

1. https://devblogs.nvidia.com/expressive-algorithmic-programming-thrust/.

2. https://github.com/thrust/thrust/wiki/Quick-Start-Guide.

3. http://thrust.github.io/.

4. https://docs.nvidia.com/cuda/thrust/index.html

97
OpenACC
OpenACC I

• Like Thrust, OpenACC tries to bring heterogenous HPC to the masses.

• Its motto is “More Science, Less Programming.”

• OpenACC is a “user-driven directive-based performance-portable parallel programming model.”

• Main idea is to take existing serial code, say C++, and give hints to compiler to what should be
parallelized.

• OpenACC is a model designed to allow parallel programming across different computer architectures
with minimum effort by the developer. Portability means that the code should be independent of
hardware/compiler.

• OpenACC specification supports C/C++ and Fortran and runs in CPUs and GPUs.

98
OpenACC II

• OpenACC is built around a very simple set of directives, very similar in design to OpenMP: OpenACC
uses the fork-join paradigm.

• The same program can be compiled to be executed in parallel using the CPU or the GPU (or mixing
them), depending on the hardware available.

• Communication between the master and worker threads in the parallel pool is automatically handled,
although the user can state directives to grant explicit access to variables, and to transfer objects
from the CPU to the GPU when required.

• The OpenACC website describes multiple compilers, profilers, and debuggers for OpenACC.

• We use the PGI Community Edition compiler. The PGI compiler can be used with the CPUs and
with NVIDIA Tesla GPUs. In this way, it suffices to select different flags at compilation time to
execute the code in the CPU or the GPU.
99
Using OpenACC I: Analyze

• Use a profiler to check where your code spends lots of time. Example of bottlenecks are loops.

• Check if there is an optimized library that implements some of your code: cuBlas, Armadillo.

100
Using OpenACC II: Parallelize

• Expose your code to parallelism starting with functions/operations that are time consuming on CPU.

• To initiate parallel execution:

# pragma acc parallel

• To execute a kernel:

# pragma acc parallel kernel

• To parellelize a loop:

# pragma acc parallel for

101
Example

• Example code of parallelizing a loop in C++.

• Basic parallel loop:

#pragma acc parallel loop

for(int ix=0; ix<nx; ix++){
// ...
}

• Ensuring copy of relevant objects:

#pragma acc data copy(...)

#pragma acc parallel loop
for(int ix = 0; ix<nx; ix++){
//...
}
102
Example code II

• By choosing the appropriate compilation flag, we can compile the code to be executed in parallel only
in the CPU or in the GPU.
• To compile the code to be executed by the CPU, we must include the -ta=multicore flag at
compilation. In addition, the -acc flag must be added to tell the compiler this is OpenACC code:

pgc++ Cpp_main_OpenACC.cpp -o Cpp_main -acc

-ta=multicore

• If, instead, we want to execute the program with an NVIDIA GPU, we rely on the -ta=nvidia flag:

pgc++ Cpp_main_OpenACC.cpp -o Cpp_main -acc

-ta=nvidia

103
Using OpenACC III: Optimize

• Give info to compiler of parts that can be optimize: data management (minimize copying between
host and device).

• Instruct compiler how to parallelize loops.

• Step 3 is not trivial and maybe involved, limiting OpenACC’s applicability.

104
Additional resources

Some books and references for OpenACC programming.

• Books:

1. OpenACC Programming and Best Practices Guide, 2015.

• References:

1. https://devblogs.nvidia.com/tag/openacc/.

2. https://www.openacc.org/.

105
Comparisons
Comparisons I

• All results are specific to our life-cycle model example.

• There are other ways to improve speed on each language:

• Function wrapping in Julia.

• Vectorizing in Matlab.

• Etc.

106
Comparisons II

• The comparisons regarding parallelization are specific to the packages used on these slides:

Community Speed Parallelization Time to Debug

Difficulty Improvement program
Matlab Large Medium Easy Low Fast Easy
Julia Very small Fast Medium High Fast Easy
R Large Slow Medium High Fast Easy
Python Large Slow Medium High Fast Easy
C++ Large Fast Easy High Slow Difficult

107
Advice

• Short-run:
MATLAB, Python, Julia, or R

• Medium-run:
Rcpp

• Long-run:
C++ with OpenMP, MPI, or GPUs

108
10.0 Julia Par

Physical cores

7.5

Rcpp
Time (s)

5.0
OpenMP
MPI

2.5
CUDA

OpenAcc

1 2 3 4 5 6 7 8
# of cores 109

Project 2
No ratings yet
Project 2
19 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Parallel Computing For Data Science With Examples in R, C++ and CUDA (PDFDrive)
No ratings yet
Parallel Computing For Data Science With Examples in R, C++ and CUDA (PDFDrive)
336 pages
High Performance Computing With Applications in R: Florian Schwendinger, Gregor Kastner, Stefan Theußl
No ratings yet
High Performance Computing With Applications in R: Florian Schwendinger, Gregor Kastner, Stefan Theußl
68 pages
Programming On Parallel Machines
100% (1)
Programming On Parallel Machines
344 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Programming On Parallel Machines: Norm Matloff University of California, Davis
100% (1)
Programming On Parallel Machines: Norm Matloff University of California, Davis
347 pages
Programming On Parallel Machines
100% (2)
Programming On Parallel Machines
347 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Ray Design Patterns - Google 文档
No ratings yet
Ray Design Patterns - Google 文档
26 pages
Fundamentals of Multicore Software Development PDF
No ratings yet
Fundamentals of Multicore Software Development PDF
322 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf download
No ratings yet
An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf download
49 pages
An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf download
No ratings yet
An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf download
77 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Models For Parallel And Distributed Computation Theory Algorithmic Techniques And Applications 1st Edition Michel Cosnard Auth pdf download
100% (1)
Models For Parallel And Distributed Computation Theory Algorithmic Techniques And Applications 1st Edition Michel Cosnard Auth pdf download
79 pages
Lecture 10
No ratings yet
Lecture 10
125 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
BigData_ParallelComputing
No ratings yet
BigData_ParallelComputing
9 pages
ParProcBook PDF
No ratings yet
ParProcBook PDF
410 pages
Ch2
No ratings yet
Ch2
29 pages
Intro Supercomputing
No ratings yet
Intro Supercomputing
292 pages
CS621-CHEATSHEET.docx
No ratings yet
CS621-CHEATSHEET.docx
11 pages
HPC Note
No ratings yet
HPC Note
39 pages
89942
No ratings yet
89942
51 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
Introduction to Computational Models with Python 1st Edition Jose M. Garrido 2024 Scribd Download
No ratings yet
Introduction to Computational Models with Python 1st Edition Jose M. Garrido 2024 Scribd Download
41 pages
High Performance Computing ChapterSampler
No ratings yet
High Performance Computing ChapterSampler
124 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Lec 01
No ratings yet
Lec 01
2 pages
E- Notes -HPC-Unit 3-1
No ratings yet
E- Notes -HPC-Unit 3-1
26 pages
Khaitan PSERC Webinar HPC Mar 2013 Slides
No ratings yet
Khaitan PSERC Webinar HPC Mar 2013 Slides
52 pages
2nd
No ratings yet
2nd
19 pages
Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas - Download the ebook now and own the full detailed content
100% (1)
Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas - Download the ebook now and own the full detailed content
79 pages
IT-318: Scalable and Cloud Computing: Programming at Scale Concurrency and Consistency
No ratings yet
IT-318: Scalable and Cloud Computing: Programming at Scale Concurrency and Consistency
37 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Applied Parallel Computing-Honest
100% (1)
Applied Parallel Computing-Honest
218 pages
Parallel Models of Computation
No ratings yet
Parallel Models of Computation
3 pages
hpc_scaling
No ratings yet
hpc_scaling
56 pages
unit1 2 and 3
No ratings yet
unit1 2 and 3
76 pages
Parallel Programming
100% (2)
Parallel Programming
410 pages
Where can buy An Introduction to Parallel Programming 2nd Edition Peter Pacheco ebook with cheap price
100% (7)
Where can buy An Introduction to Parallel Programming 2nd Edition Peter Pacheco ebook with cheap price
40 pages
Parallel Computing Simply in Depth by Ajit Singh PDF
No ratings yet
Parallel Computing Simply in Depth by Ajit Singh PDF
125 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
[FREE PDF sample] Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas ebooks
100% (4)
[FREE PDF sample] Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas ebooks
40 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Introduction To Parallel Computing
100% (1)
Introduction To Parallel Computing
34 pages
1.6 Final Thoughts: 1 Parallel Programming Models 49
No ratings yet
1.6 Final Thoughts: 1 Parallel Programming Models 49
5 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
PDA_1
No ratings yet
PDA_1
72 pages
Parallel Algorithms Presentation (1)
No ratings yet
Parallel Algorithms Presentation (1)
32 pages
(Ebook) Introduction to Computational Models with Python by Garrido, Jose M. - Download the ebook in PDF with all chapters to read anytime
100% (2)
(Ebook) Introduction to Computational Models with Python by Garrido, Jose M. - Download the ebook in PDF with all chapters to read anytime
59 pages
parallel and distributed algorithms
No ratings yet
parallel and distributed algorithms
21 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
30 pages
Distributed System Using Docker Swarm
No ratings yet
Distributed System Using Docker Swarm
15 pages
Diamond
No ratings yet
Diamond
1 page
Simple Future Tense: Ing Ing Ing
No ratings yet
Simple Future Tense: Ing Ing Ing
2 pages
Black Beatles Lyrics
No ratings yet
Black Beatles Lyrics
118 pages
Interprocess Communication
No ratings yet
Interprocess Communication
4 pages
l17 Turbo Boost
No ratings yet
l17 Turbo Boost
7 pages
OS - Module2 - Apsima
No ratings yet
OS - Module2 - Apsima
26 pages
Chapter 6: Process Synchronization
No ratings yet
Chapter 6: Process Synchronization
40 pages
Multiprocessor Systems
No ratings yet
Multiprocessor Systems
22 pages
Resiliencepatterns 141105153710 Conversion Gate02
No ratings yet
Resiliencepatterns 141105153710 Conversion Gate02
81 pages
IBM_Power_The_Perfect_Platform_for_SAP_HANA_1733351270
No ratings yet
IBM_Power_The_Perfect_Platform_for_SAP_HANA_1733351270
8 pages
OSY 7 Years Assignment (1, 2 & 3)
No ratings yet
OSY 7 Years Assignment (1, 2 & 3)
3 pages
OS CHEAT SHEET FT
No ratings yet
OS CHEAT SHEET FT
6 pages
Game Log
No ratings yet
Game Log
5 pages
Embedded System Two Marks University Questions
No ratings yet
Embedded System Two Marks University Questions
2 pages
CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013
No ratings yet
CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013
21 pages
Branch Handling 1
No ratings yet
Branch Handling 1
50 pages
The Secrets of YIFY and High Quality and Small File Sizes Are Not So Secret After All Encoding High Quality Low Bitrate Videos in Handbrake For Any Device - Yan D, Ericolon - Random Fudge-Ups
100% (1)
The Secrets of YIFY and High Quality and Small File Sizes Are Not So Secret After All Encoding High Quality Low Bitrate Videos in Handbrake For Any Device - Yan D, Ericolon - Random Fudge-Ups
22 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
No ratings yet
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
27 pages
Collate OS Notes Unit 2
No ratings yet
Collate OS Notes Unit 2
34 pages
R 2008 It Syllabus
No ratings yet
R 2008 It Syllabus
89 pages
TEMS Discovery 4.0 User Guide
No ratings yet
TEMS Discovery 4.0 User Guide
420 pages
Python and Competitive Programming Mastering Coding Challenges in Nagpur
No ratings yet
Python and Competitive Programming Mastering Coding Challenges in Nagpur
4 pages
Thread (Computing)
No ratings yet
Thread (Computing)
8 pages
Operating Systems-1-Introduction - Structure - Updated
No ratings yet
Operating Systems-1-Introduction - Structure - Updated
50 pages
Log 1
No ratings yet
Log 1
42 pages
DCA
No ratings yet
DCA
14 pages
Module 2 Notes (1)
No ratings yet
Module 2 Notes (1)
68 pages
PRG 4
No ratings yet
PRG 4
12 pages
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
No ratings yet
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
118 pages
Current Log
No ratings yet
Current Log
19 pages
Python Programs Final
No ratings yet
Python Programs Final
86 pages

Lecture HPC 11 Parallelization

Uploaded by

Lecture HPC 11 Parallelization

Uploaded by

Parallel Programming

(Lectures on High-performance Computing for Economists XI)

Jesús Fernández-Villaverde,1 Pablo Guerrón,2 and David Zarruk Valencia3

• Problems when transistor size falls by a factor x:

• Alternative: having more processors!

• Main idea ⇒ divide a complex problem into easier parts:

1. Numerical computation =⇒ matrix multiplication.

2. Data handling (MapReduce and Spark) =⇒ computing moments.

• An Introduction to Parallel Programming by Peter Pacheco.

• Principles of Parallel Programming by Calvin Lin and Larry Snyder.

1. Strongly scalable: problems that are inherently easy to parallelize.

2. Weakly scalable: problems that are not.

1. Coarse: more computation than communication.

2. Fine: more communication.

• Overheads and load balancing.

• Taking advantage of your architecture.

• Trade off between speed up and coding time.

• Debugging and profiling may be challenging.

• You will need a good IDE, debugger, and profiler.

1. We have a grid of capital with 100 points, k ∈ [k1 , k2 , ..., k100 ] .

• Problem: to generate θ∗ we need to θn−1 .

P(e 0 |e) = Γ(e)

2.2 For t = T − 1, . . . , 1, use V (t + 1, ej , xi ) to solve:

V (t, ej , xi ) = max u(c) + βEV (t + 1, e 0 , x 0 ) s.t.

cons = (1+r)*xgrid[ix] + egrid[ie]*w - xgrid[ixp];

1. The simplest life-cycle model.

4.2 If embedded in a general equilibrium.

Figure 2: 8 Cores Used for Computation

• Transferring shared data to workers.

• Newest processors have plenty of processor.

• Amazon Web Services - EC2 at https://aws.amazon.com/ec2/:

• Almost as big as you want!

• Replace a large initial capital cost for a variable cost (use-as-needed).

• 8 processors with 32Gb, general purpose: $0.332 per hour.

• 64 processors with 256Gb, compute optimized: $3.20 per hour.

• Create a new key pair.

• Store it in a secure place (usually ∼./ssh/).

• Transfer folder from local to instance with scp:

$ scp -i "/path/"Harvard_Spring_2018.pem"" -r "/pathfrom/FOLDER/"

• Make sure key is not publicly available:

$ chmod 400 "Harvard_Spring_2018.pem"

• Connect to instance with ssh:

• More common in economics.

1. Packages/libraries/toolboxes within languages:

2.3 GPU programming: CUDA, OpenCL, and OpenACC.

• Less common in economics.

1. Automatic parallelization: AutoParInGCC, Intel compilers.

2. Partitioned Global Address Space Languages (PGAS):

2.1 Coarray Fortran.

3. Pthreads (POSIX threads).

4. TPUs (Tensor processing units).

5. FPGAs (field programmable gate arrays).

• Adding a statement before a for loop that wants to be parallelized.

2. Map and reduce:

• Map computes in parallel the function at a vector of states.

• Reduce combines the values returned by map in the desired way.

• Parallelization of for loops is worth for “small tasks.”

• “Small task” == “few computations on each parallel iteration”:

• Few control variables.

• Few grid points on control variables.

• Our model is a “small task.”

1. Load distributed module

2. Set number of workers:

1. Load distributed and SharedArrays modules

4. Data structure of state and exogenous variables

@everywhere struct modelState

5. Define a function that computes value function for a given state:

@everywhere function value(currentState::modelState)

for ixp = 1:nx

@distributed for ind = 1:(ne*nx)

@sync @distributed for ind = 1:(ne*nx)

• Choose appropriately the dimension(s) to parallelize:

• The first one is much faster, as there is less communication.

@sync @distributed for ind = 1:(ne*nx)

• Communication time is minimized!

Figure 4: Julia - 8 cores used for computation

cons = (1+r)xgrid[ix] + egrid[ie]w - xgrid[ixp];