0% found this document useful (0 votes)
8 views20 pages

Module 6

Uploaded by

singhguma86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

Module 6

Uploaded by

singhguma86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

MODULE SIX:

LOOP OPTIMIZATIONS
Dr. Volker Weinberg | LRZ
GANG WORKER VECTOR
GANG WORKER VECTOR

 Gang / Worker / Vector defines the


various levels of parallelism we can
achieve with OpenACC
 This parallelism is most useful when
parallelizing multi-dimensional loop Vector Workers
nests
 OpenACC allows us to define a generic Gang
Gang / Worker / Vector model that will
be applicable to a variety of hardware,
but we fill focus a little bit on a GPU
specific implementation
GANG WORKER VECTOR
 When paralleling our loops, the highest
level of parallelism is gang level
parallelism
 When encountering either the kernels or
parallel directive, multiple gangs will be
generated, and loop iterations will be Gang
spread across the gangs
 These gangs are completely
independent of each other, and there is
no way to for the programmer to know
exactly how many gangs are running at
a given time
 In many architecures, the gangs have
completely separate (or private) memory
GANG WORKER VECTOR
 In our code example, we see that we are
applying the gang clause to an outer-
loop
 This means that the outer-loop iterations
will be split across some number of
gangs Gang

 These gangs will then execute in parallel


with each other
#pragma acc parallel loop gang
 Whenever a parallel compute region is for( i = 0; i < N; i++ )
encountered, some number of gangs will for( j = 0; j < M; j++ )
be created < loop code >

 The programmer is able to specify


exactly how many gangs to create
GANG WORKER VECTOR
 A vector is the lowest level of
parallelism
Vector
 Every gang will have at least 1 vector
 A vector has the ability to run a single
instruction on multiple data elements
 Many different architectures can
implement vectors in different ways,
however, OpenACC allows for us to
define them in a general, non-hardware-
specific way
GANG WORKER VECTOR

 In our code example, the inner-loop Vector


iterations will be evenly divided across a
vector
 This means that those loop iterations will
be executing in parallel with one-another
 Any loop that is inside of our vector loop
cannot be parallelized further #pragma acc parallel loop gang
for( i = 0; i < N; i++ )
#pragma acc loop vector
for( j = 0; j < M; j++ )
< loop code >
GANG WORKER VECTOR
 The worker clause is a way for the
programmer to have multiple vectors
within a gang 3 Workers

 The primary use of the worker clause is


to split up one large vector into multiple
smaller vectors
 This can be useful when our inner
parallel loops are very small, and will not
benefit from having a large vector
GANG WORKER VECTOR

 In our sample code, we apply both gang


and worker level parallelism to our outer- 3 Workers
loop
 The main difference this creates for our
code is that we can now have smaller
vectors running the inner loop
 This will most likely improve #pragma acc parallel loop gang worker
performance if the inner loop is relatively for( i = 0; i < N; i++ )
small #pragma acc loop vector
for( j = 0; j < M; j++ )
< loop code >
PARALLEL DIRECTIVE SYNTAX
 When using the parallel directive, you may
define the number of gangs/workers/vectors #pragma acc parallel num_gangs(2) \
with num_gangs(N), num_workers(M), num_workers(2) vector_length(32)
vector_length(Q) {
#pragma acc loop gang worker
 Then, you may define where they belong in for(int x = 0; x < 4; x++){
the loops using gang, worker, vector #pragma acc loop vector
for(int y = 0; y < 32; y++){
array[x][y]++;
}
}
}
PARALLEL DIRECTIVE SYNTAX

 You may also apply gang/worker/vector


when using the parallel loop construct

#pragma acc parallel loop num_gangs(2) num_workers(2) \


vector_length(32) gang worker
for(int x = 0; x < 4; x++){
#pragma acc loop vector
for(int y = 0; y < 32; y++){
array[x][y]++;
}
}
KERNELS DIRECTIVE SYNTAX
 When using the kernels directive, the
process is somewhat simplified
#pragma acc kernels loop gang(2) worker(2)
 You may define the location and for(int x = 0; x < 4; x++){
number by using gang(N), #pragma acc loop vector(32)
worker(M), vector(Q) for(int y = 0; y < 32; y++){
array[x][y]++;
 You may also define gang, worker, }
and vector using the same method }
as with the parallel directive
 If you do not specify a number, the
compiler will decide one
KERNELS DIRECTIVE SYNTAX
 When using the kernels directive, the #pragma acc kernels
process is somewhat simplified {
#pragma acc loop gang(2) worker(2)
 You may define the location and for(int x = 0; x < 4; x++){
number by using gang(N), #pragma acc loop vector(32)
worker(M), vector(Q) for(int y = 0; y < 32; y++){
array[x][y]++;
 You may also define gang, worker, }
}
and vector using the same method
as with the parallel directive #pragma acc loop gang(4) worker(4)
for(int x = 0; x < 16; x++){
 If you do not specify a number, the #pragma acc loop vector(16)
compiler will decide one for(int y = 0; y < 16; y++){
array2[x][y]++;
 Each loop nest can have different }
values for gang, worker, and vector }
}
WARPS
 So far we have been using a very small number of gangs/worker/vectors, simply
because they’re easier to understand
 When actually programming, the number of gangs/worker/vectors will be much larger
 When specifically programming for an NVIDIA GPU, you will always want your
vectors large enough to fully utilize warps
 A warp, simply put, is an optimized group of 32 threads
 To utilize warps in OpenACC, always make sure that your vector length is a multiple
of 32
CUDA PROGRAMMING MODEL REVIEW

 A grid is composed of blocks which are completely


independent
 A block is composed of threads which can
communicate within their own block
 32 threads form a warp
 Instructions are issued per warp
 If an operand is not ready the warp will stall
 Context switch between warps when stalled
GANG WORKER VECTOR

 Gang is a general term that can mean a few different things. In short, it depends on
your architecture.
 On a multicore CPU, generally gang=thread.
 On a GPU, generally gang=thread block.

 The way I like to think of it is that gang represents my outer-most level of parallelism
for any architecture I am running on.
LOOP OPTIMIZATION RULES OF THUMB

 It is rarely a good idea to set the number of gangs in your code, let the compiler
decide.
 Most of the time you can effectively tune a loop nest by adjusting only the vector
length.
 It is rare to use a worker loop. When the vector length is very short, a worker loop
can increase the parallelism in your gang.
 When possible, the vector loop should step through your arrays
 Use the device_type clause to ensure that tuning for one architecture doesn’t
negatively affect other architectures.
MODULE REVIEW
KEY CONCEPTS
In this module we discussed…
 The loop directive enables the programmer to give more information to
the compiler about specific loops
 This information may be used for correctness or to improve
performance.
 The device_type clause allows the programmer to optimize for one
device type without hurting others.
LAB ASSIGNMENT
In this module’s lab you will…
 Update the code from the previous module in attempt to improve the
performance
 Use PGProf to analyze the performance difference when changing
your loops
 Experiment with the device_type clause to ensure GPU optimizations
don’t slow down the multicore speed-up, or vice versa

You might also like