Introduction To OpenACC Course 20161102 1530 1
Introduction To OpenACC Course 20161102 1530 1
2
Oct 26: Analyzing and Parallelizing with OpenACC
Recordings: 3
https://developer.nvidia.com/intro-to-openacc-course-2016
OPENACC OPTIMIZATIONS
Lecture 2: Jeff Larkin, NVIDIA
Today’s Objectives
Analyze
Understand OpenACC data directives
Understand the 3 levels of OpenACC parallelism
Understand how to optimize loop decomposition
Understand other common optimizations to
OpenACC codes Optimize Parallelize
5
3 Steps to Accelerate with OpenACC
Analyze
Optimize Parallelize
6
11/1/2016
8
Analyze
9
Parallelize
10
Parallelize
11
Performance after step 2…
Serial Multicore K80 (single) P100
10.00X
5.00X
Speed-up from serial
0.00X
Serial Multicore K80 (single) P100
13
Optimize
Last week we relied on Unified Virtual Memory to expose our data to both the GPU
and GPU.
To make our code more portable and give the compiler more information, we will
replace UVM with OpenACC data directives
15
Case Study: Remove Managed Memory
flag
matvec(const matrix &, const vector &, const vector
&):
8, include "matrix_functions.h"
16
Data Clauses
copyin ( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.
copyout ( list ) Allocates memory on GPU and copies data to the host when exiting
region.
copy ( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.
(Structured Only)
present ( list ) Data is already present on GPU from another containing data
region.
(!) All of these will check if the data is already present first and reuse if found. 17
Array Shaping
18
Matvec Data Clauses
#pragma acc parallel loop\ The compiler needs additional
copyout(ycoefs[:num_rows]) information about several arrays
copyin(Acoefs[:A.nnz],
xcoefs[:num_rows],
used in matvec.
cols[:A.nnz])
for(int i=0;i<num_rows;i++) { Compiler cannot determine the
... bounds of “j” loop to determine
#pragma acc loop reduction(+:sum) the bounds of these arrays.
for(int j=row_start; j<row_end;
j++) { Data clauses aren’t strictly
...; needed in dot and waxpby
}
because the compiler can
ycoefs[i]=sum;
} determine the array shape from
the loop bounds.
19
PGPROF: Data Movement
20
Manage Data Higher in the Program
Currently data is moved at the beginning and end of each function, in case the data
is needed on the CPU
We know that the data is only needed on the CPU after convergence
We should inform the compiler when data movement is really needed to improved
performance
21
Structured Data Regions
The data directive defines a region of code in which GPU arrays remain on the GPU
and are shared among all kernels in that region.
22
Structured Data Regions
The data directive defines a region of code in which GPU arrays remain on the GPU
and are shared among all kernels in that region.
!$acc data
!$acc parallel loop Arrays used within the
... data region will remain
Data Region on the GPU until the
!$acc parallel loop
... end of the data region.
!$acc end data
23
Unstructured Data Directives
Used to define data regions when scoping doesn’t allow the use of normal data
regions (e.g. the constructor/destructor of a class).
enter data Defines the start of an unstructured data lifetime
• clauses: copyin(list), create(list)
exit data Defines the end of an unstructured data lifetime
• clauses: copyout(list), delete(list), finalize
26
Explicit Data Movement: Delete Matrix
void free_matrix(matrix &A) {
unsigned int *row_offsets=A.row_offsets;
Before freeing the
unsigned int * cols=A.cols; matrix, remove it
double * coefs=A.coefs;
from the device.
#pragma acc exit data delete(A.row_offsets,A.cols,A.coefs)
#pragma acc exit data delete(A) Delete the
free(row_offsets);
free(cols); members first,
free(coefs); then the structure.
}
We must do the
same in vector.h.
27
Running With Explicit Memory Management
Rebuild the code without managed memory. Change –ta=tesla:managed to
just –ta=tesla
Expected: Actual:
Rows: 8120601, nnz: 218535025 Rows: 8120601, nnz: 218535025
Iteration: 0, Tolerance: 4.0067e+08 Iteration: 0, Tolerance: 1.9497e+05
Iteration: 10, Tolerance: 1.8772e+07 Iteration: 10, Tolerance: 1.6919e+02
Iteration: 20, Tolerance: 6.4359e+05 Iteration: 20, Tolerance: 6.2901e+00
Iteration: 30, Tolerance: 2.3202e+04 Iteration: 30, Tolerance: 2.0165e-01
Iteration: 40, Tolerance: 8.3565e+02 Iteration: 40, Tolerance: 7.4122e-03
Iteration: 50, Tolerance: 3.0039e+01 Iteration: 50, Tolerance: 2.5316e-04
Iteration: 60, Tolerance: 1.0764e+00 Iteration: 60, Tolerance: 9.9229e-06
Iteration: 70, Tolerance: 3.8360e-02 Iteration: 70, Tolerance: 3.4854e-07
Iteration: 80, Tolerance: 1.3515e-03 Iteration: 80, Tolerance: 1.2859e-08
Iteration: 90, Tolerance: 4.6209e-05 Iteration: 90, Tolerance: 5.3950e-10
Total Iterations: 100 Total Time: Total Iterations: 100 Total Time:
8.458965s 8.454335s
28
OpenACC Update Directive
Programmer specifies an array (or part of an array) that should be refreshed within a
data region.
do_something_on_device()
Copy “a” from GPU to
!$acc update host(a) CPU
do_something_on_host()
Copy “a” from CPU to
!$acc update device(a) GPU
30
PGPROF: Data Movement Now
31
Optimize Loops
32
Optimizing Matvec Loops
matvec(const matrix &, const vector &, 14 #pragma acc parallel loop \
const vector &): 15 copyout(ycoefs[:num_rows])
8, include "matrix_functions.h" copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols[:A.n
12, Generating copyin(Acoefs[:A->nnz], nz])
cols[:A->nnz]) 16 for(int i=0;i<num_rows;i++) {
Generating implicit 17 double sum=0;
copyin(row_offsets[:num_rows+1]) 18 int row_start=row_offsets[i];
Generating copyin(xcoefs[:num_rows]) 19 int row_end=row_offsets[i+1];
Generating copyout(ycoefs[:num_rows]) 20 #pragma acc loop reduction(+:sum)
Accelerator kernel generated 21 for(int j=row_start;j<row_end;j++) {
Generating Tesla code 22 unsigned int Acol=cols[j];
16, #pragma acc loop gang /* 23 double Acoef=Acoefs[j];
blockIdx.x */ 24 double xcoef=xcoefs[Acol];
21, #pragma acc loop vector(128) /* 25 sum+=Acoef*xcoef;
threadIdx.x */ 26 }
Generating reduction(+:sum) 27 ycoefs[i]=sum;
21, Loop is parallelizable 28 }
29 }
33
Optimizing Matvec Loops
matvec(const matrix &, const vector &, 14 #pragma acc parallel loop \
const vector &): 15 copyout(ycoefs[:num_rows])
8, include "matrix_functions.h" copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols[:A.n
12, Generating copyin(Acoefs[:A->nnz], nz])
cols[:A->nnz]) 16 for(int i=0;i<num_rows;i++) {
Generating implicit 17 double sum=0;
copyin(row_offsets[:num_rows+1]) 18 int row_start=row_offsets[i];
Generating copyin(xcoefs[:num_rows]) 19 int row_end=row_offsets[i+1];
Generating copyout(ycoefs[:num_rows]) 20 #pragma acc loop reduction(+:sum)
Accelerator kernel generated 21 for(int j=row_start;j<row_end;j++) {
Generating Tesla code 22 unsigned int Acol=cols[j];
16, #pragma acc loop gang /* 23 double Acoef=Acoefs[j];
blockIdx.x */ 24 double xcoef=xcoefs[Acol];
21, #pragma acc loop vector(128) /* 25 sum+=Acoef*xcoef;
threadIdx.x */ 26 }
Generating reduction(+:sum) 27 ycoefs[i]=sum;
21, Loop is parallelizable
The compiler
28 }
is vectorizing 128
iterations
29 } of this loop. How many
iterations does it really do?
34
Optimizing Matvec Loops (cont.)
The compiler does not know how
many iterations the inner loop
will do, so it chooses a default 14 void allocate_3d_poisson_matrix(matrix
value of 128. &A, int N) {
15 int num_rows=(N+1)*(N+1)*(N+1);
We can see in the initialization 16
17
int nnz=27*num_rows;
A.num_rows=num_rows;
routine that it will only iterate 18 A.row_offsets=(unsigned
27 times (number of non-zeros int*)malloc((num_rows+1)*sizeof(unsigned
int));
per row). 19 A.cols=(unsigned
int*)malloc(nnz*sizeof(unsigned int));
Reducing the vector length 20
should improve hardware A.coefs=(double*)malloc(nnz*sizeof(double)
);
utilization
35
OpenACC: 3 Levels of Parallelism
• Vector threads work in
Vector Workers lockstep (SIMD/SIMT
parallelism)
Gang • Workers compute a vector
• Gangs have 1 or more
workers and share resources
Vector Workers (such as cache, the
streaming multiprocessor,
Gang etc.)
• Multiple gangs work
independently of each other
36
OpenACC gang, worker, vector Clauses
38
Optimizing Matvec Loops: Vector Length
Use the OpenACC loop 14 #pragma acc parallel loop
vector_length(32) \
directive to force the compiler 15 copyout(ycoefs[:num_rows])
to vectorizer the inner loop. copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols
[:A.nnz])
Use vector_length to reduce 16 for(int i=0;i<num_rows;i++) {
17 double sum=0;
the vector length closer to 18 int row_start=row_offsets[i];
actual loop iterations 19 int row_end=row_offsets[i+1];
20 #pragma acc loop vector reduction(+:sum)
21 for(int j=row_start;j<row_end;j++) {
22 unsigned int Acol=cols[j];
Note: NVIDIA GPUs need 23
24
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol];
vector lengths that are 25 sum+=Acoef*xcoef;
multiples of 32 (warp size) 26 }
27 ycoefs[i]=sum;
28 }
39
Matvec Performance Limiter
40
Matvec Performance Limiter
Instruction and Memory
latency are limiting kernel
performance.
41
Matvec Performance Limiter
Instruction and Memory
latency are limiting kernel
performance.
42
Matvec Performance Limiter
43
Matvec Occupancy
44
Matvec Occupancy
45
Matvec Occupancy
Occupancy is a measure of how
well the GPU is being utilized.
46
Matvec Occupancy
Occupancy is a measure of how
well the GPU is being utilized.
47
Matvec Occupancy Occupancy is a measure of how
well the GPU is being utilized.
Gang
49
Optimizing Matvec Loops: Increase Workers
By splitting the iterations of 14 #pragma acc parallel loop gang worker \
num_workers(4) vector_length(32) \
the outer loop among workers 15 copyout(ycoefs[:num_rows])
and gangs, we’ll increase the copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols
size of the gangs. [:A.nnz])
16 for(int i=0;i<num_rows;i++) {
17 double sum=0;
The compiler will handle 18 int row_start=row_offsets[i];
breaking up the outer loop so 19 int row_end=row_offsets[i+1];
you don’t have to. 20 #pragma acc loop vector reduction(+:sum)
21 for(int j=row_start;j<row_end;j++) {
22 unsigned int Acol=cols[j];
Now we should have 4x32 = 23 double Acoef=Acoefs[j];
128 threads in each GPU 24 double xcoef=xcoefs[Acol];
threadblock 25 sum+=Acoef*xcoef;
26 }
27 ycoefs[i]=sum;
28 }
50
Increase Workers: Compiler feedback
matvec(const matrix &, const vector &, const
vector &):
8, include "matrix_functions.h"
The compiler will tell you that it 12, Generating copyin(Acoefs[:A-
has honored your loop directives. >nnz],cols[:A->nnz])
Generating implicit
copyin(row_offsets[:num_rows+1])
Generating copyin(xcoefs[:num_rows])
Generating copyout(ycoefs[:num_rows])
Accelerator kernel generated
If you’re familiar with CUDA, it’ll Generating Tesla code
also tell you how the loops are 5, Vector barrier inserted due to
potential dependence into a vector loop
mapped the CUDA thread blocks 16, #pragma acc loop gang, worker(4)
/* blockIdx.x threadIdx.y */
21, #pragma acc loop vector(32) /*
threadIdx.x */
Generating reduction(+:sum)
Vector barrier inserted due to
potential dependence out of a vector loop
21, Loop is parallelizable
51
Tuning Matvec
Original Vector Length 32 Num Workers 32
5.00X
3.20X
Speed-up from serial
1.39X
1.00X
0.00X
Original Vector Length 32 Num Workers 32
20.00X
15.00X
10.00X
Speed-up from serial
5.00X
0.00X
Serial Multicore K80 (single) P100
55
Common optimizations
56
The collapse Clause
collapse(n): Applies the associated directive to the following n
tightly nested loops.
The fastest dimension is length 2 Now the inner loop is the fastest
and fastest loop strides by 2. dimension through memory.
59
Stride-1 Memory Accesses
If all threads access the “a” Now all threads are access
element, they will be accesses contiguous elements of Aa and Ab.
every-other memory element.
60
Using QwikLabs
61
Getting access
62
This week’s labs
This week you should
complete the “Profiling
and Parallelizing with
OpenACC” and
“Expressing Data
Movement and
Optimizing Loops with
OpenACC” labs in
qwiklabs.
63
CERTIFICATION OPENACC TOOLKIT
Available after November 9th Free for Academia