0% found this document useful (0 votes)

16 views88 pages

Shared Memory and Accelerators

Uploaded by

p2b28fd7mn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views88 pages

Shared Memory and Accelerators

Uploaded by

p2b28fd7mn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

SHARED MEMORY AND

ACCELERATOR
PROGRAMMING
SHARED MEMORY
PROGRAMMING
SHARED-MEMORY
PROCESSORS
• General system memory is shared by all the cores
on a computer/node.
• Programming model is subprocesses, known as
threads.
• Master process starts and controls subprocesses.
• Threads are created/destroyed as needed
• Each thread has a set of private variables. Other
variables are shared by all threads.
SHARED-MEMORY MODEL

Processor Processor Processor Processor

Memory

Processors interact and synchronize with each

other through shared variables.

4
TYPES OF PARALLELISM
• Embarrassingly parallel (high-throughput
computing)
• Independent processes with little (or no) need to
communicate.
• Data parallelism
• Divide the data into smaller parts. Work on each part
individually, then if necessary collect results and go to
next phase.
• Task parallelism
• Perform multiple tasks at the same time on the data.
DATA PARALLELISM
• Independent tasks apply same operation to
different elements of a data set.
• Usually expressed as a loop.

for (i=0;i<imax;i++) {
a[i]=b[i]+c[i]
}

• Must be safe to perform operations concurrently

6
REAL-LIFE EXAMPLE
BRICKLAYING
• Step 1. Materials are delivered.
• Step 2. The foreman assigns the work.
• Step 3. Each mason lays brick in his assigned
section.
• Overlap of regions done by different masons must be
managed.
• Step 4. Smooth joints between sections to make
a unified whole.
COMPUTATIONAL EXAMPLE:
FINDING THE MAXIMUM
• Find the maximum of a function using a "brute force"
method.
• Evaluate the function at a huge number of randomly-
distributed points over a specified range of
independent variables.
• Distribute these points out so that each process
evaluates the function throughout the range.
• Each process computes the maximum of its sample.
• Individual maxima are returned to the master
process, which selects the maximum of maxima as
the result.

8
OPENMP
OPENMP
• OpenMP: An application programming interface
(API) for parallel programming on multiprocessors
• Compiler directives
• Library of support functions
• OpenMP works in conjunction with Fortran, C, or
C++
• Implemented within the compiler. Must be
activated by a compile-time option.
• Python can use OpenMP via C or Fortran and
wrap the result.

10
OPENMP

11
OPENMP USE CASES
• C/C++/Fortran + OpenMP sufficient to program
shared-memory computers.
• C/C++/Fortran + MPI + OpenMP a good way to
program distributed computers built out of
shared-memory nodes.
• Most modern clusters including Rivanna are of this
type.
• OpenMP is easiest to use with data-parallel
applications.
FORK/JOIN PARALLELISM
• Initially only master thread is active
• Master thread executes sequential code
• Fork: Master thread creates or awakens
additional threads to execute parallel code
• Join: At end of parallel code created threads die
or are suspended

13
HOW OPENMP WORKS

14
PRAGMAS AND
PSEUDOCOMMENTS
• Pragma: a compiler directive in C or C++
• Stands for “pragmatic information”
• A way for the programmer to communicate with the
compiler
• The compiler is free to ignore pragmas
• Syntax:
#pragma omp <rest of pragma>
• Pseudocomment: a pragma that otherwise looks like
a comment
!$omp <rest of pseudocomment>

15
COMPILING OPENMP
PROGRAMS
• OpenMP is a compiler-level library.
• Gnu Compiler Collection
• module load gcc
• gcc/g++/gfortran
• Use the –fopenmp flag when compiling.
• Intel’s compilers
• module load intel
• the compilers are icc/icpc/ifort
• Use the -qopenmp flag to compile
• Without the flag the pragmas/pseudocomments are
ignored (but any OMP headers used won’t be found).

16
RUNNING OPENMP PROGRAMS
• The default number of threads is implementation-
dependent, but usually is the number of cores it sees on a
node.
• The most common way to set the number of subprograms
is to use an environment variable OMP_NUM_THREADS
• Example
gcc -fopenmp myopm.c
• or
gfortran -fopenmp myomp.f90
• Run with
./a.out
export OMP_NUM_THREADS=4
./a.out
EXERCISE
• Using your choice of compiler, compile and run
• omphello.c
• or
• omphello.f90
• Try setting different values for
OMP_NUM_THREADS
PARALLEL FOR LOOPS
• C programs often express data-parallel operations as
for loops. Fortran equivalent is do.

for (i = first; i < size; i += prime)

marked[i] = 1;

• OpenMP makes it easy to indicate when the iterations

of a loop may execute in parallel
• Compiler takes care of generating code that
forks/joins threads and allocates the iterations to
threads
19
C/C++: PARALLEL FOR
PRAGMA
• Format:

#pragma omp parallel for

for (i = 0; i < N; i++)
a[i] = b[i] + c[i];

• Valid through the immediately following code block

• Compiler must be able to verify the run-time system
will have information it needs to schedule loop
iterations

20
FORTRAN: PARALLEL DO
PSEUDOCOMMENT
• Format:

!$omp parallel do
do i=1, N
a(i) = b(i) + c(i)
enddo
!$omp end parallel do

21
FORTRAN ONLY: WORKSHARE
• OpenMP provides the WORKSHARE option to
support array operations in Fortran.
!$omp PARALLEL WORKSHARE
A=1.
B=42.
C=2.*B
A=B*C+D
!$omp END PARALLEL WORKSHARE
WARNING!!!
• You’re telling the compiler something that had
better be true or else there will be lots of
problems
• You’re telling the compiler there are no inter-
iteration loop dependencies
• i.e, the loop iterations are completely independent
• There cannot be statements that end the loop
prematurely
• No break, return, exit, or goto
• But can have continues

23
SHARED AND PRIVATE
VARIABLES
• Shared variable: has same address in execution context
of every thread
• Private variable: has different address in execution
context of every thread
• A thread cannot access the private variables of another thread
• Default:
• Shared:
• C/C++: all static, all in whole-file scope
• Fortran: COMMON, SAVE, variables in modules
• Private:
• (First) loop variables
• Stack variables in functions (C/C++) or subroutines (Fortran) invoked in
a parallel region
• Fortran: Automatic variables within a statement block

24
DECLARING PRIVATE
VARIABLES
for (i = 0; i < N; i++)
for (j = 0; j <N; j++)
a[i][j] = myMIN(a[i][j])+myMEAN(a[i][j];

• Either loop could be executed in parallel

• We prefer to make outer loops parallel, to reduce
number of forks/joins
• We then must give each thread its own private
copy of variable j

25
PRIVATE CLAUSE
• Clause: an optional, additional component to a
pragma
• Private clause: directs compiler to make one or
more variables private
• Same syntax for C/C++ and Fortran

private ( <variable list> )

26
EXAMPLE USE OF PRIVATE
CLAUSE
C/C++
#pragma omp parallel for private(j)
for (i = 0; i <N; i++)
for (j = 0; j < N; j++)
a[i][j] = myMIN(a[i][j)+myMEAN(a[i][j]);
Fortran
!$omp parallel do private(i)
do j=1,N
do i=1,N
a(i,j)=min(a(i,j),a(i,j)+tmp)
enddo
enddo
!$omp end parallel do

27
RACE CONDITIONS
• Consider this C program segment to compute 
using the rectangle rule:
double area, pi, x;
int i, n;
...
area = 0.0;
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
area += 4.0/(1.0 + x*x);
}
pi = area / n;

28
RACE CONDITION (CONT.)
• If we simply parallelize the loop...
double area, pi, x;
int i, n;
...
area = 0.0;
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
area += 4.0/(1.0 + x*x);
}
pi = area / n;
29
RACE CONDITION (CONT.)
• ... we set up a race condition in which one
process may “race ahead” of another and not see
its change to shared variable area

area 15.432
11.667
15.230
Answer should be 18.995

Thread A 15.432
11.667 Thread B 15.230
11.667

area += 4.0/(1.0 + x*x)

30
RACE CONDITION TIME LINE
Value of area Thread A Thread B

11.667
+ 3.765
11.667

15.432 + 3.563

15.230

31
CRITICAL PRAGMA
• Critical section: a portion of code that only one
thread at a time may execute
• We denote a critical section by putting the
pragma
#pragma omp critical
in front of a block of C code.
For Fortran the equivalent is
!$omp critical
!$end omp critical
32
CORRECT, BUT INEFFICIENT,
CODE
double area, pi, x;
int i, n;
...
area = 0.0;
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
#pragma omp critical
area += 4.0/(1.0 + x*x);
}
pi = area / n;

33
SOURCE OF INEFFICIENCY
• Update to area inside a critical section
• Only one thread at a time may execute the
statement; i.e., it is sequential code
• Time to execute statement significant part of loop
• Speedup will be severely constrained by this
serialization

34
REDUCTIONS
• This pattern is called a reduction. Reductions are
so common that OpenMP provides support for
them.
• May add reduction clause to parallel for
pragma.
• Specify reduction operation and reduction
variable.
• OpenMP takes care of storing partial results in
private variables and combining partial results
after the loop.
35
REDUCTION CLAUSE
• The reduction clause has this syntax:
reduction (<op> :<variable>)
• Operators
+ Sum
* Product
& Bitwise and
| Bitwise or
^ Bitwise exclusive or
&& Logical and
|| Logical or
• Fortran, and C/C++ for OpenMP 3.1 or greater
max
min

36
-COMPUTING CODE WITH REDUCTION
CLAUSE
double area, pi, x;
int i, n;
...
area = 0.0;
//pragma should be all on one line
#pragma omp parallel for
private(x) reduction(+:area)
for (i = 0; i < n; i++) {
x = (i + 0.5)/n;
area += 4.0/(1.0 + x*x);
}
pi = area / n;

37
PARALLEL PRAGMA
• The parallel pragma launches a team of
threads.
• Execution is replicated among all threads.
• More general than parallel for or parallel do.
Permits break/exit in a loop.

38
FOR PRAGMA
• The parallel for pragma will fork the threads,
and split the for loop into parts
• The parallel pragma will fork the threads, and
execute the same for loop for each thread (i.e. not
split any loops into parts)
• But if you have already split the threads (via a
parallel pragma), and want to split a for loop
among the already existing threads (as opposed to
executing the entire loop in all threads), then use a
for pragma:

#pragma omp for

!$omp do .. !$omp end do 39
EXAMPLE USE OF FOR PRAGMA
#pragma omp parallel private(i,j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
printf ("Exiting (%d)\n", i);
break;
}
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i])/b[i];
}
40
BARRIER PRAGMA
• When a thread encounters a barrier pragma it will
wait till all threads have reached the barrier.
• #pragma omp barrier
• $!omp barrier
• This enables the threads to synchronize but does
tend to serialize the code.
MASTER AND SINGLE
PRAGMAS
• Suppose we only want to see the output once
• The single pragma directs compiler that only a single thread
should execute the block of code the pragma precedes
• Syntax:
#pragma omp single {}
!$omp single
!$omp end single
• With single the first thread who reaches it executes the
segment. The master pragma causes only the master
thread to execute the section.
#pragma omp master {}
!$omp master
!$omp end master

42
USE OF MASTER PRAGMA
#pragma omp parallel private(i,j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
#pragma omp master
printf ("Exiting (%d)\n", i);
break;
}
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i])/b[i];
}
43
ATOMIC OPERATIONS
• What if we can’t use a reduction but need to
update a variable without a race condition?
• The atomic directive ensures that a variable is
accessed by one thread at a time.
#pragma omp atomic
count+=1
!omp atomic
Count=count+1
• The Fortran version does not need an end
because atomic applies to only one statement.
USEFUL FUNCTIONS
• C/C++: #include <omp.h>
• Fortran: use omp_lib
• omp_get_num_threads()
• This function returns the number of active threads
• Returns 1 in serial regions
int omp_get_num_threads (void)
integer function omp_get_num_threads()
• omp_get_thread_num()
• This function returns the thread identification number
• If there are t threads, the ID numbers range from 0 to t-1
• The master thread has ID number 0
int omp_get_thread_num (void)
integer function omp_get_thread_num()

45
TIMING FUNCTION
• C/C++
double t1, etime;
t1=omp_get_wtime();
compute
etime=omp_get_wtime()-t1;
• Fortran
double precision :: t1, etime
t1=omp_get_wtime()
compute
etime=omp_get_wtime()-t1
OPENMP SLURM RESOURCE
REQUEST
• You will request a single node and some number of
cores on the node. Be sure that the number of cores
requested matches the OMP_NUM_THREADS
specified in your script.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=20
#SBATCH –p instructional
#SBATCH –A rivanna-training
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./myexec

47
EXERCISE
• Fortran and C/C++:
• Download omparea.c or omparea.f90
• Compile and run them as is
• Add the correction to get the right answer
• Python:
• Download and run multi.py
SHARED MEMORY
PROGRAMMING IN
PYTHON
THE GIL
• Standard Python implements a GIL (global
interpreter lock).
• Threads cannot be started within a single
interpreter.
• It can be faked but it’s slow.
• Better: Just start another process.
• Multiprocessing is standard in Python 2.7 and up.
• Next few examples are from the documentation at
http://docs.python.org/2/library/multiprocessing.h
tml
MULTIPROCESSING
• Import the package
from multiprocessing import Process
• Define a function
def f(name):
print('hello from '+name)
• Multiprocessing can only run as main
if __name__ == '__main__':
ncpus=2
for i in range(ncpus):
p=Process(target=f,args=('str(i)',))
p.start()
p.join()
WORKER POOLS
• For manager-worker problems, we can start a pool of
workers.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
result = pool.apply_async(f, [10])
print(result.get(timeout=1))
print(pool.map(f, range(10)))
EXAMPLE
import multiprocessing as MP
import os
from pylab import *
import random

# This is the function that we want to

# compute for various different parameters
def spikes(weight):
number=weight*random.random()
return number

if __name__ == '__main__':
ncpus=int(os.getenv("NUM_THREADS"))
pool = MP.Pool(processes=ncpus)
weights = linspace(0, 3.5, 100)
volt=0.2
args = [w * volt for w in weights]
# launches multiple processes
results = pool.map(spikes, args)
plot(weights, results, '.')
show()
MORE REALISTIC EXAMPLE
• From http://kmdouglass.github.io/posts/learning-
pythons-multiprocessing-module.html
import multiprocessing
def runSimulation(params):
"""This is the main processing function. It will contain
whatever code should be run on multiple processors. """
param1, param2 = params
# Example computation
processedData = []
for ctr in range(1000000):
processedData.append(param1*ctr - param2**2)
return processedData
REALISTIC EXAMPLE (CONT)
if __name__ == '__main__':
ncpus=4
# Define the parameters to test
param1 = range(100)
param2 = range(2, 202, 2)
# Zip the parameters because pool.map() takes
only one iterable
params = zip(param1, param2)
pool = multiprocessing.Pool(processes=npcus)
results = pool.map(runSimulation, params)
print(results)
ADD TIMING INFORMATION
#add import time to top of file, use time.time() rather than clock
#Run under Python 2 for consistency of map behavior
from __future__ import print_function
if __name__ == '__main__':
ncpus=4
# Define the parameters to test
param1 = range(100)
param2 = range(2, 202, 2)
# Zip the parameters because pool.map() takes only one iterable
params = zip(param1, param2)
pool = multiprocessing.Pool(processes=ncpus)
tic=time.time ()
results = pool.map(runSimulation, params)
toc=time.time ()
print("Parallel time "+str(toc-tic))
tic=time.time ()
results=map(runSimulation, params)
toc=time.time ()
print("Serial time "+str(toc-tic))
pool.close(); pool.join()
RESULT
• On my workstation with Python 2.7.11 this results
in
Parallel time 9.66964006424
Serial time 25.3560800552
• This is a speedup of a factor of 2.6
• The efficiency is thus 0.66
WARNING
• If no processes argument is given to Pool it starts
as many processes as it detects cores on your
machine.
• If using a shared resource this can be bad. You
must tell it how many to use and it must match
your request.
• We can use os.getenv('Envvar') to get
the value of an environment variable, if we have
an appropriate variable.
PYTHON MP SLURM RESOURCE
REQUEST
• You will request a single node and some number of
cores on the node. You will need to set some
environment variable, then use os.getenv() in your
code to capture it.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=20
#SBATCH –p instructional
#SBATCH –A rivanna-training
export NUM_THREADS=${SLURM_CPUS_PER_TASK}
module load mpi4py/3.0.0-py3.6
python mymulti.py
59
IN THE CODE
import os
import multiprocessing
…
def f(name):
print("Greetings from "+str(name))

ncpus=int(os.getenv('NUM_THREADS'))
pool=multiprocessing.Pool(processes=ncpus)
pool.map(f,range(ncpus))
pool.close()
pool.join()
SAMPLE RESULT
Greetings from 0
Greetings from 1
Greetings from 3
Greetings from 2
• Why are they not in order??
• Multiprocessing is not deterministic! The output
depends on in which order the processes complete
and can access the standard output.
• The map method of Pool does force determism
for the return values. So if we change it a bit:
KEEP IT ORDERED
import os
from multiprocessing import Pool

def f(name):
return "Greetings from "+str(name)

ncpus=int(os.getenv('SLURM_NTASKS'))
pool=Pool(processes=ncpus)
result=pool.map(f,range(ncpus))
print(';'.join(result))
pool.close()
pool.join()
• Output:
Greetings from 0;Greetings from 1;Greetings from 2;Greetings from 3
EXERCISE
• Write a multiprocessing program that computes
the sum of the cubes of the numbers from 1.0 to
1000.0 by increments of 0.1
• Add the timing routines to compare the parallel
and serial times. You can use the cluster frontend
if you do not have a Python 2 or 3 environment,
with
module load anaconda/5.2.0-py2.7
• or
module load anaconda/5.2.0-py3.6
PROGRAMMING NEW
HARDWARE
ACCELERATORS
• Accelerators include
• General-purpose GPUs (GPGPUs)
• Intel MIC
• These are programmed with OpenACC (GPGPUs)
or OpenMP (MIC, GPGPU extensions to OpenMP
are in newer versions).
GENERAL PURPOSE
GRAPHICAL
PROCESSING UNITS
GPGPU ARCHITECTURE
• GPUs have
• thousands of ALUs (Arithmetic Logic Units) compared
to 4-8 for a CPU
• very limited instruction sets
• very fast memory bandwidth
• but not that much memory
• Most GPGPUs do not have hardware for double
precision floats.
• K80 does, P100 does not
PROGRAMMING MODELS
• CUDA
• CUDA is a library from NVIDIA that allows general-purpose
computations on devices originally designed for graphics
• OpenMP
• OpenMP will support both NVIDIA and AMD GPGPUs but
compiler support is still lagging
• OpenACC
• Developed by the Portland Group compiler vendors for
NVIDIA devices.
• Bought by NVIDIA, will soon become the “NVIDIA HPC SDK”.
• Similar to OpenMP (pragma/pseudocomment based)
• Will be our example
PROGRAMMING MODEL
Copy data to GPU
Perform computation

acc parallel

GPGPU
acc end parallel

Copy output to CPU

Free GPU memory
OPENACC
• OpenACC is available in Portland Group compilers
and in Gnu compilers starting with 5.0
• Supports C/C++/Fortran
• Pragmas similar to OpenMP
• #pragma acc
• !$acc
• PGI also supports CUDA and provides Fortran
bindings for it.
PARALLEL REGIONS
• Pragmas are the same as for OpenMP with acc instead of omp
• These include
• #pragma acc parallel
• #pragma acc parallel if (<conditional>)
• #pragma acc parallel private (varlist)
• OpenACC uses a generic loop rather than for/do
• #pragma acc parallel
• #pragma acc loop
• Or
• !$acc parallel
• !$acc loop
• !$acc end parallel
REDUCTIONS
• #pragma acc parallel reduction(operator:list)
• Provides a max and min for C/C++ as well as Fortran
• C operators are max, min, +, *, &, |, &&, ||, ^
• Fortran operators are max, min, +, *, iand, ior, .and.,
.or., .eqv., .neqv.
EXAMPLE - SAXPY
• single precision a times x plus y
• a scalar, x and y one-dimensional arrays
• array version of "fused multiply add" and should
be fast for good performance of many algorithms.
SERIAL VERSION

Fortran C
subroutine saxpy(n, a, x, y) void saxpy(int n, float a,
real, dimension(n), intent(in)::x,y float *x, float *restrict y){
real , intent(in) :: a //restrict prohibits aliasing
integer, intent (in) :: n
for (int i=0;i<n;++i)
integer :: i
do i=1,n y[i]=a*x[i]+y[i];
y(i) = a*x(i)+y(i) }
enddo ...
end subroutine saxpy
//Perform SAXPY on 1M
...
//elements
! Perform SAXPY on 1M elements
saxpy(1<<20,2.0,x,y);
call saxpy(2**20, x_d, y_d)
OPENMP VERSION

Fortran C
subroutine saxpy(n, a, x, y) void saxpy(int n, float a,
real, dimension(n), intent(in)::x,y float *x, float *restrict y){
real , intent(in) :: a
#pragma omp parallel for
integer, intent (in) :: n
integer :: I for (int i=0;i<n;++i)
!$omp parallel do y[i]=a*x[i]+y[i];
do i=1,n
}
y(i) = a*x(i)+y(i)
enddo ...
!$end omp parallel do //Perform SAXPY on 1M
end subroutine saxpy //elements
...
saxpy(1<<20,2.0,x,y);
! Perform SAXPY on 1M elements
call saxpy(2**20, x_d, y_d)
OPENACC VERSION

Fortran C
subroutine saxpy(n, a, x, y) void saxpy(int n, float a,
real, dimension(n), intent(in)::x,y float *x, float *restrict y){
real , intent(in) :: a
#pragma acc parallel loop
integer, intent (in) :: n
integer :: I for (int i=0;i<n;++i)
!$acc parallel loop y[i]=a*x[i]+y[i];
do i=1,n
}
y(i) = a*x(i)+y(i)
enddo ...
!$end acc parallel loop //Perform SAXPY on 1M
end subroutine saxpy //elements
...
saxpy(1<<20,2.0,x,y);
! Perform SAXPY on 1M elements
call saxpy(2**20, x_d, y_d)
COMPILING
• With the PGI compiler
module load pgi
pgcc -acc mysaxpy.c
pgf90 -acc mysaxpy.f90
pgCC -acc mysaxpy.cxx
SPECIAL CONSIDERATIONS FOR
GPGPUS
• GPGPUs have a small amount of memory with
very high bandwidth
• Management of data movement to/from the
device is critical for performance
• These pragmas do not have (basic) OpenMP
equivalents.
KERNELS AND DATA
• Kernels are implemented on the device
• #pragma acc kernels
• !$acc kernels
• !$acc end kernels
• Data constructs are regions where data is
accessible to the device
• #pragma acc data
• !$acc data
• !$acc end data
KERNELS
• With the keyword kernel the compiler determines what can be
offloaded to the gpu
!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0
c(i) = 2.0
end do
do i=1,n
a(i) = b(i)+c(i)
enddo
!$acc end kernels
DATA MOVEMENT
• Enter data
• Data are allocated on and moved to the device
• #pragma acc enter data
• !$acc enter data
• Exit data
• The data will stay on the device to the end of the program
or to the next exit data pragma
• #pragma acc exit data
• !$acc exit data
• Update data
• Copies data between memory for the thread and the device
• #pragma acc update
• !$acc update
EXCESSIVE COPIES SLOW
DOWN CODE
• Example:
while (error>tol && i<maxIter) {
#pragma acc parallel loop reduction(max:err)
Data transfer into GPU
for (something) {do things with A, Anew}
Data transfer out of GPU
}
• Data Regions
• !$acc data
• !$acc end data
• #pragma acc data
DATA DIRECTIVES
• copy (list)
• copy items in the list to the GPU on entry; to host on exit
• copyin (list)
• copy items in list to the GPU
• copyout (list)
• copy items in list from the GPU
• create (list)
• allocate GPU memory but do not copy (good for temporary
variables not needed outside)
• present (list)
• assert that items in list are already present in the GPU's
memory
IMPROVING THE DOUBLE LOOP
#pragma acc data copy(A), create(Anew)
while (error>tol && i<maxIter) {
#pragma acc parallel loop reduction(max:err)
for (something) {do things with A, Anew}
}
PYTHON AND GPGPUS
• Numba
• from numba import cuda
• We use the cuda.jit decorator
• Check for a gpu
• print(cuda.gpus())
• https://numba.pydata.org/numba-
doc/latest/cuda/overview.html
NUMBA EXAMPLE
• From their documentation
from numba import cuda, float32

# Controls threads per block and shared memory usage.

# The computation will be done on blocks of TPBxTPB elements.
TPB = 16

@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

x, y = cuda.grid(2)

tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid

if x >= C.shape[0] and y >= C.shape[1]:

# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long
#vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]

# Wait until all threads finish preloading

cuda.syncthreads()

# Computes partial product on the shared memory

for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]

# Wait until all threads finish computing

cuda.syncthreads()

C[x, y] = tmp
PYCUDA
• This is a free product

• Requires the CUDA libraries, which must be installed by the user (or whoever administers the system)

• Example from their documentation

import pycuda.autoinit

import pycuda.driver as drv

import numpy

from pycuda.compiler import SourceModule

mod = SourceModule("""

global void multiply_them(float dest, float a, float *b)

const int i = threadIdx.x;

dest[i] = a[i] * b[i];

""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)

b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)

multiply_them(

drv.Out(dest), drv.In(a), drv.In(b),

block=(400,1,1), grid=(1,1))

MKS DeviceNet DIDO User Manual
No ratings yet
MKS DeviceNet DIDO User Manual
31 pages
AI Agent Minor Project Report
No ratings yet
AI Agent Minor Project Report
28 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
NT00378-12-En T300 User Manual
No ratings yet
NT00378-12-En T300 User Manual
474 pages
Pen Tool CheatSheet Photoshop CC
No ratings yet
Pen Tool CheatSheet Photoshop CC
1 page
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
No ratings yet
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
73 pages
Openmp
No ratings yet
Openmp
115 pages
CS-3006 8 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 8 UsingOpenMP SharedMemoryProgramming
61 pages
Poorv
No ratings yet
Poorv
46 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Carrental PDF
No ratings yet
Carrental PDF
32 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
About OpenMP
No ratings yet
About OpenMP
86 pages
Zebra L10 Rugged Tablet - Accessory Guide
No ratings yet
Zebra L10 Rugged Tablet - Accessory Guide
41 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
Unit Iii
No ratings yet
Unit Iii
61 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Microsoft Powerbi Connection Adw
No ratings yet
Microsoft Powerbi Connection Adw
11 pages
OpenMP P1
No ratings yet
OpenMP P1
32 pages
Plantgro Windows Manual Final
No ratings yet
Plantgro Windows Manual Final
125 pages
Openmp Boston
No ratings yet
Openmp Boston
90 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
Intel® Volume Management Device LED Management Tool Release Notes and Guide
No ratings yet
Intel® Volume Management Device LED Management Tool Release Notes and Guide
8 pages
Chapter 2 (SOFTCOPY)
No ratings yet
Chapter 2 (SOFTCOPY)
4 pages
W7L2 OpenMP4 Worksharing
No ratings yet
W7L2 OpenMP4 Worksharing
26 pages
Lecture - 06 (Shared Memory Programming With OpenMP)
No ratings yet
Lecture - 06 (Shared Memory Programming With OpenMP)
65 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
1 page
27.1.5 Lab - Convert Data Into A Universal Format
100% (1)
27.1.5 Lab - Convert Data Into A Universal Format
9 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
CIS VMware ESXi 7.0 Benchmark v1.3.0 DRAFT - PDF
No ratings yet
CIS VMware ESXi 7.0 Benchmark v1.3.0 DRAFT - PDF
252 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
HPC Summary
No ratings yet
HPC Summary
17 pages
Openmp
No ratings yet
Openmp
61 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
ARRI - SkyPanel S60-C - Data Sheet - EN - Sep2018
No ratings yet
ARRI - SkyPanel S60-C - Data Sheet - EN - Sep2018
2 pages
Open MP1
No ratings yet
Open MP1
15 pages
Shared Memory: Openmp Environment and Synchronization
No ratings yet
Shared Memory: Openmp Environment and Synchronization
32 pages
4 Openmp
No ratings yet
4 Openmp
32 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
10 OpenMP-2
No ratings yet
10 OpenMP-2
25 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
IOT Based On Multilevel Fluid & Air-Cooling System For Battery Protection
No ratings yet
IOT Based On Multilevel Fluid & Air-Cooling System For Battery Protection
46 pages
Presentation Mis Dss Eis and Ess
No ratings yet
Presentation Mis Dss Eis and Ess
28 pages
Lecture 10 Shared Memory Programming With OpenMP
No ratings yet
Lecture 10 Shared Memory Programming With OpenMP
30 pages
Introduction To Open MP
No ratings yet
Introduction To Open MP
42 pages
UML Activity-InteractionOverview
No ratings yet
UML Activity-InteractionOverview
52 pages
Openmp: Openmp Adds Constructs For Shared-Memory
No ratings yet
Openmp: Openmp Adds Constructs For Shared-Memory
15 pages
Openmp: Parallel Processing
No ratings yet
Openmp: Parallel Processing
40 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Unit III
No ratings yet
Unit III
15 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
46 pages
Chapter 5
No ratings yet
Chapter 5
92 pages
Open MP
No ratings yet
Open MP
35 pages
Num Tech
No ratings yet
Num Tech
39 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
Man User DPS232
No ratings yet
Man User DPS232
110 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
No ratings yet
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
24 pages
Openmp Programming: Aiichiro Nakano
No ratings yet
Openmp Programming: Aiichiro Nakano
10 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Question Bank Section - A (Short Answer Questions) Unit - 1
No ratings yet
Question Bank Section - A (Short Answer Questions) Unit - 1
3 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
Squid Proxy On Rhel5
No ratings yet
Squid Proxy On Rhel5
5 pages
Class 11 English Guess (2025) - 1
75% (4)
Class 11 English Guess (2025) - 1
20 pages
Harshal More Aws&Devops
No ratings yet
Harshal More Aws&Devops
3 pages
Overset Meshing in Ansys Fluent
No ratings yet
Overset Meshing in Ansys Fluent
28 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
3 InfobasicProgrammingInjBASE PDF
No ratings yet
3 InfobasicProgrammingInjBASE PDF
107 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Foreign and Russian Experience of Blockchain
No ratings yet
Foreign and Russian Experience of Blockchain
7 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
253 Apl Modbus-Protocol en 140929
No ratings yet
253 Apl Modbus-Protocol en 140929
12 pages
Specialized Full Stack Software Developer Training Introduction Series 01
No ratings yet
Specialized Full Stack Software Developer Training Introduction Series 01
16 pages
DS 5535 Supplemental Questions For Visa Applicants
No ratings yet
DS 5535 Supplemental Questions For Visa Applicants
3 pages

Shared Memory and Accelerators

Uploaded by

Shared Memory and Accelerators

Uploaded by

SHARED MEMORY AND

Processor Processor Processor Processor

Processors interact and synchronize with each

• Must be safe to perform operations concurrently

for (i = first; i < size; i += prime)

• OpenMP makes it easy to indicate when the iterations

#pragma omp parallel for

• Valid through the immediately following code block

• Either loop could be executed in parallel

private ( <variable list> )

area += 4.0/(1.0 + x*x)

#pragma omp for

# This is the function that we want to

Copy output to CPU

# Controls threads per block and shared memory usage.

if x >= C.shape[0] and y >= C.shape[1]:

# Wait until all threads finish preloading

# Computes partial product on the shared memory

# Wait until all threads finish computing

• Example from their documentation

import pycuda.driver as drv

from pycuda.compiler import SourceModule

__global__ void multiply_them(float *dest, float *a, float *b)

const int i = threadIdx.x;

dest[i] = a[i] * b[i];

drv.Out(dest), drv.In(a), drv.In(b),

You might also like

global void multiply_them(float dest, float a, float *b)