Module6
Module6
LOOP OPTIMIZATIONS
Dr. Volker Weinberg | LRZ
GANG WORKER VECTOR
GANG WORKER VECTOR
Gang is a general term that can mean a few different things. In short, it depends on
your architecture.
On a multicore CPU, generally gang=thread.
On a GPU, generally gang=thread block.
The way I like to think of it is that gang represents my outer-most level of parallelism
for any architecture I am running on.
LOOP OPTIMIZATION RULES OF THUMB
It is rarely a good idea to set the number of gangs in your code, let the compiler
decide.
Most of the time you can effectively tune a loop nest by adjusting only the vector
length.
It is rare to use a worker loop. When the vector length is very short, a worker loop
can increase the parallelism in your gang.
When possible, the vector loop should step through your arrays
Use the device_type clause to ensure that tuning for one architecture doesn’t
negatively affect other architectures.
MODULE REVIEW
KEY CONCEPTS
In this module we discussed…
The loop directive enables the programmer to give more information to
the compiler about specific loops
This information may be used for correctness or to improve
performance.
The device_type clause allows the programmer to optimize for one
device type without hurting others.
LAB ASSIGNMENT
In this module’s lab you will…
Update the code from the previous module in attempt to improve the
performance
Use PGProf to analyze the performance difference when changing
your loops
Experiment with the device_type clause to ensure GPU optimizations
don’t slow down the multicore speed-up, or vice versa