0% found this document useful (0 votes)
20 views64 pages

Introduction To OpenACC Course 20161102 1530 1

The document outlines a lecture on OpenACC optimizations aimed at accelerating applications. It covers key concepts such as data directives, levels of parallelism, and optimization techniques, including loop decomposition and data movement strategies. A case study on the conjugate gradient method is provided to illustrate the application of these concepts in C/C++ and Fortran code.

Uploaded by

foertter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views64 pages

Introduction To OpenACC Course 20161102 1530 1

The document outlines a lecture on OpenACC optimizations aimed at accelerating applications. It covers key concepts such as data directives, levels of parallelism, and optimization techniques, including loop decomposition and data movement strategies. A case study on the conjugate gradient method is provided to illustrate the application of these concepts in C/C++ and Fortran code.

Uploaded by

foertter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

INTRODUCTION TO OPENACC

Lecture 2: OpenACC Optimizations, November 2, 2016


Course Objective:

Enable you to accelerate your applications


with OpenACC.

2
Oct 26: Analyzing and Parallelizing with OpenACC

Course Syllabus Nov 2: OpenACC Optimizations


Nov 9: Advanced OpenACC

Recordings: 3
https://developer.nvidia.com/intro-to-openacc-course-2016
OPENACC OPTIMIZATIONS
Lecture 2: Jeff Larkin, NVIDIA
Today’s Objectives

Analyze
Understand OpenACC data directives
Understand the 3 levels of OpenACC parallelism
Understand how to optimize loop decomposition
Understand other common optimizations to
OpenACC codes Optimize Parallelize

5
3 Steps to Accelerate with OpenACC

Analyze

Optimize Parallelize

6
11/1/2016

Case Study: Conjugate Gradient

A sample code implementing the conjugate


gradient method has been provided in C/C++ and
Fortran.
• To save space, only the C will be shown in slides.

You do not need to understand the algorithm to


proceed, but should be able to understand C, C++,
or Fortran.
For more information on the CG method, see
https://en.wikipedia.org/wiki/Conjugate_gradient
_method
7
Analyze

8
Analyze

Obtain a performance profile


Read compiler feedback
Understand the code.

9
Parallelize

10
Parallelize

Insert OpenACC directives


around important loops
Enable OpenACC in the compiler
Run on a parallel platform

11
Performance after step 2…
Serial Multicore K80 (single) P100
10.00X

5.00X
Speed-up from serial

0.00X
Serial Multicore K80 (single) P100

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 12


Optimize

13
Optimize

Get new performance data from


parallel execution
Remove unnecessary data
transfer to/from GPU
Guide the compiler to better
loop decomposition
Refactor the code to make it
more parallel
14
Optimize Data Movement

Last week we relied on Unified Virtual Memory to expose our data to both the GPU
and GPU.
To make our code more portable and give the compiler more information, we will
replace UVM with OpenACC data directives

15
Case Study: Remove Managed Memory

Remove the “managed” PGCC-S-0155-Compiler failed to translate accelerator


region (see -Minfo messages): Could not find
suboption to the –ta compiler allocated-variable index for symbol (main.cpp: 12)

flag
matvec(const matrix &, const vector &, const vector
&):
8, include "matrix_functions.h"

Now the compiler aborts 12, Accelerator kernel generated


Generating Tesla code
because it doesn’t know the 15, #pragma acc loop gang /* blockIdx.x */
20, #pragma acc loop vector(128) /*
sizes of the arrays used in threadIdx.x */

matvec function Generating reduction(+:sum)


20, Accelerator restriction: size of the GPU
copy of Acoefs,cols,xcoefs is unknown
Loop is parallelizable
PGCC/x86 Linux 16.9-0: compilation completed with
severe errors

16
Data Clauses
copyin ( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.

copyout ( list ) Allocates memory on GPU and copies data to the host when exiting
region.

copy ( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.
(Structured Only)

create ( list ) Allocates memory on GPU but does not copy.

delete( list ) Deallocate memory on the GPU without copying. (Unstructured


Only)

present ( list ) Data is already present on GPU from another containing data
region.

(!) All of these will check if the data is already present first and reuse if found. 17
Array Shaping

Compiler sometimes cannot determine size of arrays


Must specify explicitly using data clauses and array “shape”
Partial arrays must be contiguous
C/C++
#pragma acc data copyin(a[0:nelem]) copyout(b[s/4:3*s/4])
Fortran
!$acc data copyin(a(1:end)) copyout(b(s/4:3*s/4))

18
Matvec Data Clauses
#pragma acc parallel loop\ The compiler needs additional
copyout(ycoefs[:num_rows]) information about several arrays
copyin(Acoefs[:A.nnz],
xcoefs[:num_rows],
used in matvec.
cols[:A.nnz])
for(int i=0;i<num_rows;i++) { Compiler cannot determine the
... bounds of “j” loop to determine
#pragma acc loop reduction(+:sum) the bounds of these arrays.
for(int j=row_start; j<row_end;
j++) { Data clauses aren’t strictly
...; needed in dot and waxpby
}
because the compiler can
ycoefs[i]=sum;
} determine the array shape from
the loop bounds.
19
PGPROF: Data Movement

20
Manage Data Higher in the Program

Currently data is moved at the beginning and end of each function, in case the data
is needed on the CPU
We know that the data is only needed on the CPU after convergence
We should inform the compiler when data movement is really needed to improved
performance

21
Structured Data Regions

The data directive defines a region of code in which GPU arrays remain on the GPU
and are shared among all kernels in that region.

#pragma acc data


{ Arrays used within the
#pragma acc parallel loop data region will remain
... on the GPU until the
Data Region
#pragma acc parallel loop end of the data region.
...
}

22
Structured Data Regions

The data directive defines a region of code in which GPU arrays remain on the GPU
and are shared among all kernels in that region.

!$acc data
!$acc parallel loop Arrays used within the
... data region will remain
Data Region on the GPU until the
!$acc parallel loop
... end of the data region.
!$acc end data

23
Unstructured Data Directives
Used to define data regions when scoping doesn’t allow the use of normal data
regions (e.g. the constructor/destructor of a class).
enter data Defines the start of an unstructured data lifetime
• clauses: copyin(list), create(list)
exit data Defines the end of an unstructured data lifetime
• clauses: copyout(list), delete(list), finalize

#pragma acc enter data copyin(a)


...
#pragma acc exit data delete(a)
24
Unstructured Data: C++ Classes
class Matrix {
Matrix(int n) {
Unstructured Data Regions len = n;
enable OpenACC to be used in v = new double[len];
C++ classes #pragma acc enter data
create(v[0:len])
}
~Matrix() {
Unstructured data regions can #pragma acc exit data
be used whenever data is delete(v[0:len])
delete[] v;
allocated and initialized in a
}
different scope than where it is
freed (e.g. Fortran modules). private:
double* v;
int len;
}; 25 25
Explicit Data Movement: Copy In Matrix
void allocate_3d_poission_matrix(matrix &A, int N) {
int num_rows=(N+1)*(N+1)*(N+1);
After allocating
int nnz=27*num_rows; and initializing our
A.num_rows=num_rows;
A.row_offsets = (unsigned int*) \
matrix, copy it to
malloc((num_rows+1)*sizeof(unsigned int)); the device.
A.cols = (unsigned int*)malloc(nnz*sizeof(unsigned int));
A.coefs = (double*)malloc(nnz*sizeof(double));
Copy the structure
// Initialize Matrix first and its
A.row_offsets[num_rows]=nnz; members second.
A.nnz=nnz;
#pragma acc enter data copyin(A)
#pragma acc enter data \
copyin(A.row_offsets[:num_rows+1],A.cols[:nnz],A.coefs[:nnz])
}

26
Explicit Data Movement: Delete Matrix
void free_matrix(matrix &A) {
unsigned int *row_offsets=A.row_offsets;
Before freeing the
unsigned int * cols=A.cols; matrix, remove it
double * coefs=A.coefs;
from the device.
#pragma acc exit data delete(A.row_offsets,A.cols,A.coefs)
#pragma acc exit data delete(A) Delete the
free(row_offsets);
free(cols); members first,
free(coefs); then the structure.
}

We must do the
same in vector.h.

27
Running With Explicit Memory Management
Rebuild the code without managed memory. Change –ta=tesla:managed to
just –ta=tesla

Expected: Actual:
Rows: 8120601, nnz: 218535025 Rows: 8120601, nnz: 218535025
Iteration: 0, Tolerance: 4.0067e+08 Iteration: 0, Tolerance: 1.9497e+05
Iteration: 10, Tolerance: 1.8772e+07 Iteration: 10, Tolerance: 1.6919e+02
Iteration: 20, Tolerance: 6.4359e+05 Iteration: 20, Tolerance: 6.2901e+00
Iteration: 30, Tolerance: 2.3202e+04 Iteration: 30, Tolerance: 2.0165e-01
Iteration: 40, Tolerance: 8.3565e+02 Iteration: 40, Tolerance: 7.4122e-03
Iteration: 50, Tolerance: 3.0039e+01 Iteration: 50, Tolerance: 2.5316e-04
Iteration: 60, Tolerance: 1.0764e+00 Iteration: 60, Tolerance: 9.9229e-06
Iteration: 70, Tolerance: 3.8360e-02 Iteration: 70, Tolerance: 3.4854e-07
Iteration: 80, Tolerance: 1.3515e-03 Iteration: 80, Tolerance: 1.2859e-08
Iteration: 90, Tolerance: 4.6209e-05 Iteration: 90, Tolerance: 5.3950e-10
Total Iterations: 100 Total Time: Total Iterations: 100 Total Time:
8.458965s 8.454335s
28
OpenACC Update Directive

Programmer specifies an array (or part of an array) that should be refreshed within a
data region.

do_something_on_device()
Copy “a” from GPU to
!$acc update host(a) CPU
do_something_on_host()
Copy “a” from CPU to
!$acc update device(a) GPU

Note: Update “host” has been deprecated and renamed “self”


29
Explicit Data Movement: Update Vector

void initialize_vector(vector &v,double val)


{
After we change vector on the
CPU, we need to update it on
for(int i=0;i<v.n;i++)
v.coefs[i]=val;
the GPU.
#pragma acc update device(v.coefs[:v.n])
}

Update device : CPU -> GPU


Update self/host: GPU -> CPU

30
PGPROF: Data Movement Now

31
Optimize Loops

Now let’s look at how our iterations get mapped to hardware.


Compilers give their best guess about how to transform loops into
parallel kernels, but sometimes they need more information.
This information could be our knowledge of the code or based on
profiling.

32
Optimizing Matvec Loops
matvec(const matrix &, const vector &, 14 #pragma acc parallel loop \
const vector &): 15 copyout(ycoefs[:num_rows])
8, include "matrix_functions.h" copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols[:A.n
12, Generating copyin(Acoefs[:A->nnz], nz])
cols[:A->nnz]) 16 for(int i=0;i<num_rows;i++) {
Generating implicit 17 double sum=0;
copyin(row_offsets[:num_rows+1]) 18 int row_start=row_offsets[i];
Generating copyin(xcoefs[:num_rows]) 19 int row_end=row_offsets[i+1];
Generating copyout(ycoefs[:num_rows]) 20 #pragma acc loop reduction(+:sum)
Accelerator kernel generated 21 for(int j=row_start;j<row_end;j++) {
Generating Tesla code 22 unsigned int Acol=cols[j];
16, #pragma acc loop gang /* 23 double Acoef=Acoefs[j];
blockIdx.x */ 24 double xcoef=xcoefs[Acol];
21, #pragma acc loop vector(128) /* 25 sum+=Acoef*xcoef;
threadIdx.x */ 26 }
Generating reduction(+:sum) 27 ycoefs[i]=sum;
21, Loop is parallelizable 28 }
29 }
33
Optimizing Matvec Loops
matvec(const matrix &, const vector &, 14 #pragma acc parallel loop \
const vector &): 15 copyout(ycoefs[:num_rows])
8, include "matrix_functions.h" copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols[:A.n
12, Generating copyin(Acoefs[:A->nnz], nz])
cols[:A->nnz]) 16 for(int i=0;i<num_rows;i++) {
Generating implicit 17 double sum=0;
copyin(row_offsets[:num_rows+1]) 18 int row_start=row_offsets[i];
Generating copyin(xcoefs[:num_rows]) 19 int row_end=row_offsets[i+1];
Generating copyout(ycoefs[:num_rows]) 20 #pragma acc loop reduction(+:sum)
Accelerator kernel generated 21 for(int j=row_start;j<row_end;j++) {
Generating Tesla code 22 unsigned int Acol=cols[j];
16, #pragma acc loop gang /* 23 double Acoef=Acoefs[j];
blockIdx.x */ 24 double xcoef=xcoefs[Acol];
21, #pragma acc loop vector(128) /* 25 sum+=Acoef*xcoef;
threadIdx.x */ 26 }
Generating reduction(+:sum) 27 ycoefs[i]=sum;
21, Loop is parallelizable
The compiler
28 }
is vectorizing 128
iterations
29 } of this loop. How many
iterations does it really do?
34
Optimizing Matvec Loops (cont.)
The compiler does not know how
many iterations the inner loop
will do, so it chooses a default 14 void allocate_3d_poisson_matrix(matrix
value of 128. &A, int N) {
15 int num_rows=(N+1)*(N+1)*(N+1);
We can see in the initialization 16
17
int nnz=27*num_rows;
A.num_rows=num_rows;
routine that it will only iterate 18 A.row_offsets=(unsigned
27 times (number of non-zeros int*)malloc((num_rows+1)*sizeof(unsigned
int));
per row). 19 A.cols=(unsigned
int*)malloc(nnz*sizeof(unsigned int));
Reducing the vector length 20
should improve hardware A.coefs=(double*)malloc(nnz*sizeof(double)
);
utilization
35
OpenACC: 3 Levels of Parallelism
• Vector threads work in
Vector Workers lockstep (SIMD/SIMT
parallelism)
Gang • Workers compute a vector
• Gangs have 1 or more
workers and share resources
Vector Workers (such as cache, the
streaming multiprocessor,
Gang etc.)
• Multiple gangs work
independently of each other

36
OpenACC gang, worker, vector Clauses

gang, worker, and vector can be added to a loop clause


A parallel region can only specify one of each gang, worker, vector
Control the size using the following clauses on the parallel region
num_gangs(n), num_workers(n), vector_length(n)

#pragma acc parallel loop gang #pragma acc parallel vector_length(32)


for (int i = 0; i < n; ++i) #pragma acc loop gang worker
#pragma acc loop vector for (int i = 0; i < n; ++i)
for (int j = 0; j < n; ++j) #pragma acc loop vector
... for (int j = 0; j < n; ++j)
...

38
Optimizing Matvec Loops: Vector Length
Use the OpenACC loop 14 #pragma acc parallel loop
vector_length(32) \
directive to force the compiler 15 copyout(ycoefs[:num_rows])
to vectorizer the inner loop. copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols
[:A.nnz])
Use vector_length to reduce 16 for(int i=0;i<num_rows;i++) {
17 double sum=0;
the vector length closer to 18 int row_start=row_offsets[i];
actual loop iterations 19 int row_end=row_offsets[i+1];
20 #pragma acc loop vector reduction(+:sum)
21 for(int j=row_start;j<row_end;j++) {
22 unsigned int Acol=cols[j];
Note: NVIDIA GPUs need 23
24
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol];
vector lengths that are 25 sum+=Acoef*xcoef;
multiples of 32 (warp size) 26 }
27 ycoefs[i]=sum;
28 }
39
Matvec Performance Limiter

40
Matvec Performance Limiter
Instruction and Memory
latency are limiting kernel
performance.

41
Matvec Performance Limiter
Instruction and Memory
latency are limiting kernel
performance.

Recall: GPU’s tolerate


latency by having
enough parallelism

42
Matvec Performance Limiter

PGPROF will guide you to


performance a latency
analysis to understand why

43
Matvec Occupancy

Occupancy is a measure of how


well the GPU is being utilized.

100% occupancy means the GPU is


running as many simultaneous
threads as it can.

44
Matvec Occupancy

Occupancy is a measure of how


well the GPU is being utilized.

100% occupancy means the GPU is


running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%


occupied, why?

45
Matvec Occupancy
Occupancy is a measure of how
well the GPU is being utilized.

100% occupancy means the GPU is


running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%


occupied, why?

We need 64 threadblocks to get


100% occupancy, but the hardware
can only manage 16. Why?

46
Matvec Occupancy
Occupancy is a measure of how
well the GPU is being utilized.

100% occupancy means the GPU is


running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%


occupied, why?

We need 64 threadblocks to get


100% occupancy, but the hardware
can only manage 16. Why?

We’ve reduced our vector length


so that each block has only 1 warp

47
Matvec Occupancy Occupancy is a measure of how
well the GPU is being utilized.

100% occupancy means the GPU is


running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%


occupied, why?

We need 64 threadblocks to get


100% occupancy, but the hardware
can only manage 16. Why?

We’ve reduced our vector length


so that each block has only 1 warp

So we need at least 4X more


parallelism per gang
48
Increasing per-gang parallelism
• The inner loop lacks
Vector Workers sufficient parallelism to
occupy the GPU
Gang
• We can increase the size of
the gang by increasing the
Vector Workers number of workers

Gang

49
Optimizing Matvec Loops: Increase Workers
By splitting the iterations of 14 #pragma acc parallel loop gang worker \
num_workers(4) vector_length(32) \
the outer loop among workers 15 copyout(ycoefs[:num_rows])
and gangs, we’ll increase the copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols
size of the gangs. [:A.nnz])
16 for(int i=0;i<num_rows;i++) {
17 double sum=0;
The compiler will handle 18 int row_start=row_offsets[i];
breaking up the outer loop so 19 int row_end=row_offsets[i+1];
you don’t have to. 20 #pragma acc loop vector reduction(+:sum)
21 for(int j=row_start;j<row_end;j++) {
22 unsigned int Acol=cols[j];
Now we should have 4x32 = 23 double Acoef=Acoefs[j];
128 threads in each GPU 24 double xcoef=xcoefs[Acol];
threadblock 25 sum+=Acoef*xcoef;
26 }
27 ycoefs[i]=sum;
28 }
50
Increase Workers: Compiler feedback
matvec(const matrix &, const vector &, const
vector &):
8, include "matrix_functions.h"
The compiler will tell you that it 12, Generating copyin(Acoefs[:A-
has honored your loop directives. >nnz],cols[:A->nnz])
Generating implicit
copyin(row_offsets[:num_rows+1])
Generating copyin(xcoefs[:num_rows])
Generating copyout(ycoefs[:num_rows])
Accelerator kernel generated
If you’re familiar with CUDA, it’ll Generating Tesla code
also tell you how the loops are 5, Vector barrier inserted due to
potential dependence into a vector loop
mapped the CUDA thread blocks 16, #pragma acc loop gang, worker(4)
/* blockIdx.x threadIdx.y */
21, #pragma acc loop vector(32) /*
threadIdx.x */
Generating reduction(+:sum)
Vector barrier inserted due to
potential dependence out of a vector loop
21, Loop is parallelizable
51
Tuning Matvec
Original Vector Length 32 Num Workers 32
5.00X

3.20X
Speed-up from serial

1.39X
1.00X

0.00X
Original Vector Length 32 Num Workers 32

Source: PGI 16.9, NVIDIA Tesla K80 52


Final Performance
Serial Multicore K80 (single) P100
25.00X

20.00X

15.00X

10.00X
Speed-up from serial

5.00X

0.00X
Serial Multicore K80 (single) P100

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 53


The device_type clause

Use device_type to #pragma acc parallel loop \


specialize device_type(nvidia) vector_length(256)
optimizations to device_type(radeon) vector_length(512)
specific hardware. for(int i = 0; i < n ; i++)
The compiler will {
choose values for ...;
all other targets. }

55
Common optimizations

56
The collapse Clause
collapse(n): Applies the associated directive to the following n
tightly nested loops.

#pragma acc parallel loop \ #pragma acc parallel loop


collapse(2) for(int ij=0; ij<N*N; ij++)
for(int i=0; i<N; i++) ...
for(int j=0; j<M; j++)
...

Collapse outer loops to enable creating more gangs.


Collapse inner loops to enable longer vector lengths.
Collapse all loops, when possible, to do both.
57
The tile clause
Operate on smaller blocks of the operation to exploit data locality

#pragma acc loop tile(4,4)


for(i = 1; i <= ROWS; i++) {
for(j = 1; j <= COLUMNS; j++) {
Temp[i][j] = 0.25 *
(Temp_last[i+1][j] +
Temp_last[i-1][j] +
Temp_last[i][j+1] +
Temp_last[i][j-1]);
}
}
58
Stride-1 Memory Accesses

for(i=0; i<N; i++) for(i=0; i<N; i++)


for(j=0; j<M; j++) for(j=0; j<M; j++)
{ {
A[i][j][1] = 1.0f; A[1][i][j] = 1.0f;
A[i][j][2] = 0.0f; A[2][i][j] = 0.0f;
} }
} }

The fastest dimension is length 2 Now the inner loop is the fastest
and fastest loop strides by 2. dimension through memory.
59
Stride-1 Memory Accesses

for(i=0; i<N; i++) for(i=0; i<N; i++)


for(j=0; j<M; j++) for(j=0; j<M; j++)
{ {
A[i][j].a= 1.0f; Aa[i][j] = 1.0f;
A[i][j].b = 0.0f; Ab[i][j] = 0.0f;
} }
} }

If all threads access the “a” Now all threads are access
element, they will be accesses contiguous elements of Aa and Ab.
every-other memory element.
60
Using QwikLabs

61
Getting access

1. Create an account with NVIDIA qwikLABS


https://developer.nvidia.com/qwiklabs-
signup
2. Enter a promo code OPENACC16 before
submitting the form
3. Free credits will be added to your account
4. Start using OpenACC!

62
This week’s labs
This week you should
complete the “Profiling
and Parallelizing with
OpenACC” and
“Expressing Data
Movement and
Optimizing Loops with
OpenACC” labs in
qwiklabs.

63
CERTIFICATION OPENACC TOOLKIT
Available after November 9th Free for Academia

1. Attend live lectures Download link:


https://developer.nvidia.com/openacc-toolkit
2. Complete the test
3. Enter for a chance to win a
Titan X or an OpenACC Book

NEW OPENACC BOOK


Parallel Programming with OpenACC

Available starting Nov 1st, 2016:


http://store.elsevier.com/Parallel-Programming-
with-OpenACC/Rob-Farber/isbn-9780124103979/
Official rules:
http://developer.download.nvidia.com/compute/OpenACC-
64
Toolkit/docs/TITANX-GIVEAWAY-OPENACC-Official-Rules-2016.pdf
Where to find help
• OpenACC Course Recordings - https://developer.nvidia.com/openacc-courses
• PGI Website - http://www.pgroup.com/resources
• OpenACC on StackOverflow - http://stackoverflow.com/questions/tagged/openacc
• OpenACC Toolkit - http://developer.nvidia.com/openacc-toolkit
• Parallel Forall Blog - http://devblogs.nvidia.com/parallelforall/
• GPU Technology Conference - http://www.gputechconf.com/
• OpenACC Website - http://openacc.org/

Questions? Email [email protected] 65


Oct 26: Analyzing and Parallelizing with OpenACC

Course Syllabus Nov 2: OpenACC Optimizations


Nov 9: Advanced OpenACC

Recordings: Questions? Email [email protected]


66
https://developer.nvidia.com/intro-to-openacc-course-2016

You might also like