Module 6

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views20 pages

Module 6

Uploaded by

singhguma86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

MODULE SIX:

LOOP OPTIMIZATIONS
Dr. Volker Weinberg | LRZ
GANG WORKER VECTOR
GANG WORKER VECTOR

 Gang / Worker / Vector defines the

various levels of parallelism we can
achieve with OpenACC
 This parallelism is most useful when
parallelizing multi-dimensional loop Vector Workers
nests
 OpenACC allows us to define a generic Gang
Gang / Worker / Vector model that will
be applicable to a variety of hardware,
but we fill focus a little bit on a GPU
specific implementation
GANG WORKER VECTOR
 When paralleling our loops, the highest
level of parallelism is gang level
parallelism
 When encountering either the kernels or
parallel directive, multiple gangs will be
generated, and loop iterations will be Gang
spread across the gangs
 These gangs are completely
independent of each other, and there is
no way to for the programmer to know
exactly how many gangs are running at
a given time
 In many architecures, the gangs have
completely separate (or private) memory
GANG WORKER VECTOR
 In our code example, we see that we are
applying the gang clause to an outer-
loop
 This means that the outer-loop iterations
will be split across some number of
gangs Gang

 These gangs will then execute in parallel

with each other
#pragma acc parallel loop gang
 Whenever a parallel compute region is for( i = 0; i < N; i++ )
encountered, some number of gangs will for( j = 0; j < M; j++ )
be created < loop code >

 The programmer is able to specify

exactly how many gangs to create
GANG WORKER VECTOR
 A vector is the lowest level of
parallelism
Vector
 Every gang will have at least 1 vector
 A vector has the ability to run a single
instruction on multiple data elements
 Many different architectures can
implement vectors in different ways,
however, OpenACC allows for us to
define them in a general, non-hardware-
specific way
GANG WORKER VECTOR

 In our code example, the inner-loop Vector

iterations will be evenly divided across a
vector
 This means that those loop iterations will
be executing in parallel with one-another
 Any loop that is inside of our vector loop
cannot be parallelized further #pragma acc parallel loop gang
for( i = 0; i < N; i++ )
#pragma acc loop vector
for( j = 0; j < M; j++ )
< loop code >
GANG WORKER VECTOR
 The worker clause is a way for the
programmer to have multiple vectors
within a gang 3 Workers

 The primary use of the worker clause is

to split up one large vector into multiple
smaller vectors
 This can be useful when our inner
parallel loops are very small, and will not
benefit from having a large vector
GANG WORKER VECTOR

 In our sample code, we apply both gang

and worker level parallelism to our outer- 3 Workers
loop
 The main difference this creates for our
code is that we can now have smaller
vectors running the inner loop
 This will most likely improve #pragma acc parallel loop gang worker
performance if the inner loop is relatively for( i = 0; i < N; i++ )
small #pragma acc loop vector
for( j = 0; j < M; j++ )
< loop code >
PARALLEL DIRECTIVE SYNTAX
 When using the parallel directive, you may
define the number of gangs/workers/vectors #pragma acc parallel num_gangs(2) \
with num_gangs(N), num_workers(M), num_workers(2) vector_length(32)
vector_length(Q) {
#pragma acc loop gang worker
 Then, you may define where they belong in for(int x = 0; x < 4; x++){
the loops using gang, worker, vector #pragma acc loop vector
for(int y = 0; y < 32; y++){
array[x][y]++;
}
}
}
PARALLEL DIRECTIVE SYNTAX

 You may also apply gang/worker/vector

when using the parallel loop construct

#pragma acc parallel loop num_gangs(2) num_workers(2) \

vector_length(32) gang worker
for(int x = 0; x < 4; x++){
#pragma acc loop vector
for(int y = 0; y < 32; y++){
array[x][y]++;
}
}
KERNELS DIRECTIVE SYNTAX
 When using the kernels directive, the
process is somewhat simplified
#pragma acc kernels loop gang(2) worker(2)
 You may define the location and for(int x = 0; x < 4; x++){
number by using gang(N), #pragma acc loop vector(32)
worker(M), vector(Q) for(int y = 0; y < 32; y++){
array[x][y]++;
 You may also define gang, worker, }
and vector using the same method }
as with the parallel directive
 If you do not specify a number, the
compiler will decide one
KERNELS DIRECTIVE SYNTAX
 When using the kernels directive, the #pragma acc kernels
process is somewhat simplified {
#pragma acc loop gang(2) worker(2)
 You may define the location and for(int x = 0; x < 4; x++){
number by using gang(N), #pragma acc loop vector(32)
worker(M), vector(Q) for(int y = 0; y < 32; y++){
array[x][y]++;
 You may also define gang, worker, }
}
and vector using the same method
as with the parallel directive #pragma acc loop gang(4) worker(4)
for(int x = 0; x < 16; x++){
 If you do not specify a number, the #pragma acc loop vector(16)
compiler will decide one for(int y = 0; y < 16; y++){
array2[x][y]++;
 Each loop nest can have different }
values for gang, worker, and vector }
}
WARPS
 So far we have been using a very small number of gangs/worker/vectors, simply
because they’re easier to understand
 When actually programming, the number of gangs/worker/vectors will be much larger
 When specifically programming for an NVIDIA GPU, you will always want your
vectors large enough to fully utilize warps
 A warp, simply put, is an optimized group of 32 threads
 To utilize warps in OpenACC, always make sure that your vector length is a multiple
of 32
CUDA PROGRAMMING MODEL REVIEW

 A grid is composed of blocks which are completely

independent
 A block is composed of threads which can
communicate within their own block
 32 threads form a warp
 Instructions are issued per warp
 If an operand is not ready the warp will stall
 Context switch between warps when stalled
GANG WORKER VECTOR

 Gang is a general term that can mean a few different things. In short, it depends on
your architecture.
 On a multicore CPU, generally gang=thread.
 On a GPU, generally gang=thread block.

 The way I like to think of it is that gang represents my outer-most level of parallelism
for any architecture I am running on.
LOOP OPTIMIZATION RULES OF THUMB

 It is rarely a good idea to set the number of gangs in your code, let the compiler
decide.
 Most of the time you can effectively tune a loop nest by adjusting only the vector
length.
 It is rare to use a worker loop. When the vector length is very short, a worker loop
can increase the parallelism in your gang.
 When possible, the vector loop should step through your arrays
 Use the device_type clause to ensure that tuning for one architecture doesn’t
negatively affect other architectures.
MODULE REVIEW
KEY CONCEPTS
In this module we discussed…
 The loop directive enables the programmer to give more information to
the compiler about specific loops
 This information may be used for correctness or to improve
performance.
 The device_type clause allows the programmer to optimize for one
device type without hurting others.
LAB ASSIGNMENT
In this module’s lab you will…
 Update the code from the previous module in attempt to improve the
performance
 Use PGProf to analyze the performance difference when changing
your loops
 Experiment with the device_type clause to ensure GPU optimizations
don’t slow down the multicore speed-up, or vice versa

CDBM Mod03 Answers
No ratings yet
CDBM Mod03 Answers
20 pages
Chapter Exercise Questions
No ratings yet
Chapter Exercise Questions
25 pages
Software Development Life Cycle
No ratings yet
Software Development Life Cycle
6 pages
Online Pizza Ordering System
No ratings yet
Online Pizza Ordering System
9 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
Lab 7
No ratings yet
Lab 7
3 pages
Module 3
No ratings yet
Module 3
34 pages
Parallel Programming Module 3
No ratings yet
Parallel Programming Module 3
44 pages
Excelente
No ratings yet
Excelente
64 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
High Speed electronics-UoH - 4-Vivado-Presentation
No ratings yet
High Speed electronics-UoH - 4-Vivado-Presentation
66 pages
W7L2 OpenMP4 Worksharing
No ratings yet
W7L2 OpenMP4 Worksharing
26 pages
PDC Lecture 04
No ratings yet
PDC Lecture 04
44 pages
4 Openmp
No ratings yet
4 Openmp
32 pages
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
No ratings yet
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
22 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
W8L2 OpenMP6 Furthertopics
No ratings yet
W8L2 OpenMP6 Furthertopics
20 pages
Directives Tips For Fortran
No ratings yet
Directives Tips For Fortran
15 pages
OPENMP Language Features - Part 1 - 2
No ratings yet
OPENMP Language Features - Part 1 - 2
38 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
CP4292 Multicore Architecture Lab Manual
No ratings yet
CP4292 Multicore Architecture Lab Manual
36 pages
CS33 S25 L16 Exceptions Annotated
No ratings yet
CS33 S25 L16 Exceptions Annotated
74 pages
Daa 1
No ratings yet
Daa 1
40 pages
Coursera Lecture 11.1 OpenACC Intro
No ratings yet
Coursera Lecture 11.1 OpenACC Intro
11 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Openmp
No ratings yet
Openmp
115 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
CSL 730: Parallel Programming: Openmp
No ratings yet
CSL 730: Parallel Programming: Openmp
74 pages
Lecture 9-OpenMP Coclusion
No ratings yet
Lecture 9-OpenMP Coclusion
39 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
Module 3 Quiz
No ratings yet
Module 3 Quiz
2 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
No ratings yet
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
41 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Openmp Boston
No ratings yet
Openmp Boston
90 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
10 OpenMP-2
No ratings yet
10 OpenMP-2
25 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
OpenMP Performance Consideration
No ratings yet
OpenMP Performance Consideration
49 pages
W8L2 OpenMP7 Tasks
No ratings yet
W8L2 OpenMP7 Tasks
21 pages
Vector Addition: Exercise 1 (Openmp-I) Scenario - I
100% (1)
Vector Addition: Exercise 1 (Openmp-I) Scenario - I
15 pages
2 TypesofParallelism
No ratings yet
2 TypesofParallelism
69 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Slides
No ratings yet
Slides
24 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
No ratings yet
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
58 pages
Chapter 5
No ratings yet
Chapter 5
92 pages
Performance
No ratings yet
Performance
51 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
Unit III
No ratings yet
Unit III
15 pages
Ch8a Myppt
No ratings yet
Ch8a Myppt
42 pages
Compiler Construction
No ratings yet
Compiler Construction
16 pages
Project of Excel Advanced
No ratings yet
Project of Excel Advanced
8 pages
Yozo Log
No ratings yet
Yozo Log
8 pages
05 Methods and Structures
No ratings yet
05 Methods and Structures
4 pages
String Handling in Java
No ratings yet
String Handling in Java
28 pages
Import DataTable To Excel C#
No ratings yet
Import DataTable To Excel C#
4 pages
Messages
No ratings yet
Messages
2 pages
SQL Concepts and Queries
No ratings yet
SQL Concepts and Queries
11 pages
OOM Case Study
No ratings yet
OOM Case Study
31 pages
Student Database
No ratings yet
Student Database
27 pages
Falcon Sensor For Linux Modes: Kernel Mode
No ratings yet
Falcon Sensor For Linux Modes: Kernel Mode
7 pages
FTP Raw Commands Cheat Sheet
No ratings yet
FTP Raw Commands Cheat Sheet
2 pages
Intekhab Resume MernStack 2023
No ratings yet
Intekhab Resume MernStack 2023
1 page
Az 104 Part1
No ratings yet
Az 104 Part1
342 pages
10 1016@j Future 2019 04 047 PDF
No ratings yet
10 1016@j Future 2019 04 047 PDF
31 pages
Nikhil Report
No ratings yet
Nikhil Report
33 pages
AZ 900 Certya
No ratings yet
AZ 900 Certya
196 pages
Shriya Pandey Resume.43970649523275f6260a
No ratings yet
Shriya Pandey Resume.43970649523275f6260a
1 page
Lab 12 Dbms
No ratings yet
Lab 12 Dbms
4 pages
7278
No ratings yet
7278
4 pages
SQL Tutorial: SELECT Statement Basics
No ratings yet
SQL Tutorial: SELECT Statement Basics
10 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
JavaScript Essentials 1
No ratings yet
JavaScript Essentials 1
13 pages
Programming For Everybody (Getting Started With Python) - Home - Coursera - 5 PDF
0% (1)
Programming For Everybody (Getting Started With Python) - Home - Coursera - 5 PDF
3 pages
Unit IV
No ratings yet
Unit IV
82 pages
CS506 MCQS For Papers and Quize
No ratings yet
CS506 MCQS For Papers and Quize
52 pages

Module 6

Uploaded by

Module 6

Uploaded by

MODULE SIX:

 Gang / Worker / Vector defines the

 These gangs will then execute in parallel

 The programmer is able to specify

 In our code example, the inner-loop Vector

 The primary use of the worker clause is

 In our sample code, we apply both gang

 You may also apply gang/worker/vector

#pragma acc parallel loop num_gangs(2) num_workers(2) \

 A grid is composed of blocks which are completely

You might also like