0% found this document useful (0 votes)

20 views64 pages

Introduction To OpenACC Course 20161102 1530 1

The document outlines a lecture on OpenACC optimizations aimed at accelerating applications. It covers key concepts such as data directives, levels of parallelism, and optimization techniques, including loop decomposition and data movement strategies. A case study on the conjugate gradient method is provided to illustrate the application of these concepts in C/C++ and Fortran code.

Uploaded by

foertter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views64 pages

Introduction To OpenACC Course 20161102 1530 1

Uploaded by

foertter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

INTRODUCTION TO OPENACC

Lecture 2: OpenACC Optimizations, November 2, 2016

Course Objective:

Enable you to accelerate your applications

with OpenACC.

2
Oct 26: Analyzing and Parallelizing with OpenACC

Course Syllabus Nov 2: OpenACC Optimizations

Nov 9: Advanced OpenACC

Recordings: 3
https://developer.nvidia.com/intro-to-openacc-course-2016
OPENACC OPTIMIZATIONS
Lecture 2: Jeff Larkin, NVIDIA
Today’s Objectives

Analyze
Understand OpenACC data directives
Understand the 3 levels of OpenACC parallelism
Understand how to optimize loop decomposition
Understand other common optimizations to
OpenACC codes Optimize Parallelize

5
3 Steps to Accelerate with OpenACC

Analyze

Optimize Parallelize

6
11/1/2016

Case Study: Conjugate Gradient

A sample code implementing the conjugate

gradient method has been provided in C/C++ and
Fortran.
• To save space, only the C will be shown in slides.

You do not need to understand the algorithm to

proceed, but should be able to understand C, C++,
or Fortran.
For more information on the CG method, see
https://en.wikipedia.org/wiki/Conjugate_gradient
_method
7
Analyze

8
Analyze

Obtain a performance profile

Read compiler feedback
Understand the code.

9
Parallelize

10
Parallelize

Insert OpenACC directives

around important loops
Enable OpenACC in the compiler
Run on a parallel platform

11
Performance after step 2…
Serial Multicore K80 (single) P100
10.00X

5.00X
Speed-up from serial

0.00X
Serial Multicore K80 (single) P100

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 12

Optimize

13
Optimize

Get new performance data from

parallel execution
Remove unnecessary data
transfer to/from GPU
Guide the compiler to better
loop decomposition
Refactor the code to make it
more parallel
14
Optimize Data Movement

Last week we relied on Unified Virtual Memory to expose our data to both the GPU
and GPU.
To make our code more portable and give the compiler more information, we will
replace UVM with OpenACC data directives

15
Case Study: Remove Managed Memory

Remove the “managed” PGCC-S-0155-Compiler failed to translate accelerator

region (see -Minfo messages): Could not find
suboption to the –ta compiler allocated-variable index for symbol (main.cpp: 12)

flag
matvec(const matrix &, const vector &, const vector
&):
8, include "matrix_functions.h"

Now the compiler aborts 12, Accelerator kernel generated

Generating Tesla code
because it doesn’t know the 15, #pragma acc loop gang /* blockIdx.x */
20, #pragma acc loop vector(128) /*
sizes of the arrays used in threadIdx.x */

matvec function Generating reduction(+:sum)

20, Accelerator restriction: size of the GPU
copy of Acoefs,cols,xcoefs is unknown
Loop is parallelizable
PGCC/x86 Linux 16.9-0: compilation completed with
severe errors

16
Data Clauses
copyin ( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.

copyout ( list ) Allocates memory on GPU and copies data to the host when exiting
region.

copy ( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.
(Structured Only)

create ( list ) Allocates memory on GPU but does not copy.

delete( list ) Deallocate memory on the GPU without copying. (Unstructured

Only)

present ( list ) Data is already present on GPU from another containing data
region.

(!) All of these will check if the data is already present first and reuse if found. 17
Array Shaping

Compiler sometimes cannot determine size of arrays

Must specify explicitly using data clauses and array “shape”
Partial arrays must be contiguous
C/C++
#pragma acc data copyin(a[0:nelem]) copyout(b[s/4:3*s/4])
Fortran
!$acc data copyin(a(1:end)) copyout(b(s/4:3*s/4))

18
Matvec Data Clauses
#pragma acc parallel loop\ The compiler needs additional
copyout(ycoefs[:num_rows]) information about several arrays
copyin(Acoefs[:A.nnz],
xcoefs[:num_rows],
used in matvec.
cols[:A.nnz])
for(int i=0;i<num_rows;i++) { Compiler cannot determine the
... bounds of “j” loop to determine
#pragma acc loop reduction(+:sum) the bounds of these arrays.
for(int j=row_start; j<row_end;
j++) { Data clauses aren’t strictly
...; needed in dot and waxpby
}
because the compiler can
ycoefs[i]=sum;
} determine the array shape from
the loop bounds.
19
PGPROF: Data Movement

20
Manage Data Higher in the Program

Currently data is moved at the beginning and end of each function, in case the data
is needed on the CPU
We know that the data is only needed on the CPU after convergence
We should inform the compiler when data movement is really needed to improved
performance

21
Structured Data Regions

The data directive defines a region of code in which GPU arrays remain on the GPU
and are shared among all kernels in that region.

#pragma acc data

{ Arrays used within the
#pragma acc parallel loop data region will remain
... on the GPU until the
Data Region
#pragma acc parallel loop end of the data region.
...
}

22
Structured Data Regions

The data directive defines a region of code in which GPU arrays remain on the GPU
and are shared among all kernels in that region.

!$acc data
!$acc parallel loop Arrays used within the
... data region will remain
Data Region on the GPU until the
!$acc parallel loop
... end of the data region.
!$acc end data

23
Unstructured Data Directives
Used to define data regions when scoping doesn’t allow the use of normal data
regions (e.g. the constructor/destructor of a class).
enter data Defines the start of an unstructured data lifetime
• clauses: copyin(list), create(list)
exit data Defines the end of an unstructured data lifetime
• clauses: copyout(list), delete(list), finalize

#pragma acc enter data copyin(a)

...
#pragma acc exit data delete(a)
24
Unstructured Data: C++ Classes
class Matrix {
Matrix(int n) {
Unstructured Data Regions len = n;
enable OpenACC to be used in v = new double[len];
C++ classes #pragma acc enter data
create(v[0:len])
}
~Matrix() {
Unstructured data regions can #pragma acc exit data
be used whenever data is delete(v[0:len])
delete[] v;
allocated and initialized in a
}
different scope than where it is
freed (e.g. Fortran modules). private:
double* v;
int len;
}; 25 25
Explicit Data Movement: Copy In Matrix
void allocate_3d_poission_matrix(matrix &A, int N) {
int num_rows=(N+1)*(N+1)*(N+1);
After allocating
int nnz=27*num_rows; and initializing our
A.num_rows=num_rows;
A.row_offsets = (unsigned int*) \
matrix, copy it to
malloc((num_rows+1)*sizeof(unsigned int)); the device.
A.cols = (unsigned int*)malloc(nnz*sizeof(unsigned int));
A.coefs = (double*)malloc(nnz*sizeof(double));
Copy the structure
// Initialize Matrix first and its
A.row_offsets[num_rows]=nnz; members second.
A.nnz=nnz;
#pragma acc enter data copyin(A)
#pragma acc enter data \
copyin(A.row_offsets[:num_rows+1],A.cols[:nnz],A.coefs[:nnz])
}

26
Explicit Data Movement: Delete Matrix
void free_matrix(matrix &A) {
unsigned int *row_offsets=A.row_offsets;
Before freeing the
unsigned int * cols=A.cols; matrix, remove it
double * coefs=A.coefs;
from the device.
#pragma acc exit data delete(A.row_offsets,A.cols,A.coefs)
#pragma acc exit data delete(A) Delete the
free(row_offsets);
free(cols); members first,
free(coefs); then the structure.
}

We must do the
same in vector.h.

27
Running With Explicit Memory Management
Rebuild the code without managed memory. Change –ta=tesla:managed to
just –ta=tesla

Expected: Actual:
Rows: 8120601, nnz: 218535025 Rows: 8120601, nnz: 218535025
Iteration: 0, Tolerance: 4.0067e+08 Iteration: 0, Tolerance: 1.9497e+05
Iteration: 10, Tolerance: 1.8772e+07 Iteration: 10, Tolerance: 1.6919e+02
Iteration: 20, Tolerance: 6.4359e+05 Iteration: 20, Tolerance: 6.2901e+00
Iteration: 30, Tolerance: 2.3202e+04 Iteration: 30, Tolerance: 2.0165e-01
Iteration: 40, Tolerance: 8.3565e+02 Iteration: 40, Tolerance: 7.4122e-03
Iteration: 50, Tolerance: 3.0039e+01 Iteration: 50, Tolerance: 2.5316e-04
Iteration: 60, Tolerance: 1.0764e+00 Iteration: 60, Tolerance: 9.9229e-06
Iteration: 70, Tolerance: 3.8360e-02 Iteration: 70, Tolerance: 3.4854e-07
Iteration: 80, Tolerance: 1.3515e-03 Iteration: 80, Tolerance: 1.2859e-08
Iteration: 90, Tolerance: 4.6209e-05 Iteration: 90, Tolerance: 5.3950e-10
Total Iterations: 100 Total Time: Total Iterations: 100 Total Time:
8.458965s 8.454335s
28
OpenACC Update Directive

Programmer specifies an array (or part of an array) that should be refreshed within a
data region.

do_something_on_device()
Copy “a” from GPU to
!$acc update host(a) CPU
do_something_on_host()
Copy “a” from CPU to
!$acc update device(a) GPU

Note: Update “host” has been deprecated and renamed “self”

29
Explicit Data Movement: Update Vector

void initialize_vector(vector &v,double val)

{
After we change vector on the
CPU, we need to update it on
for(int i=0;i<v.n;i++)
v.coefs[i]=val;
the GPU.
#pragma acc update device(v.coefs[:v.n])
}

Update device : CPU -> GPU

Update self/host: GPU -> CPU

30
PGPROF: Data Movement Now

31
Optimize Loops

Now let’s look at how our iterations get mapped to hardware.

Compilers give their best guess about how to transform loops into
parallel kernels, but sometimes they need more information.
This information could be our knowledge of the code or based on
profiling.

32
Optimizing Matvec Loops
matvec(const matrix &, const vector &, 14 #pragma acc parallel loop \
const vector &): 15 copyout(ycoefs[:num_rows])
8, include "matrix_functions.h" copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols[:A.n
12, Generating copyin(Acoefs[:A->nnz], nz])
cols[:A->nnz]) 16 for(int i=0;i<num_rows;i++) {
Generating implicit 17 double sum=0;
copyin(row_offsets[:num_rows+1]) 18 int row_start=row_offsets[i];
Generating copyin(xcoefs[:num_rows]) 19 int row_end=row_offsets[i+1];
Generating copyout(ycoefs[:num_rows]) 20 #pragma acc loop reduction(+:sum)
Accelerator kernel generated 21 for(int j=row_start;j<row_end;j++) {
Generating Tesla code 22 unsigned int Acol=cols[j];
16, #pragma acc loop gang /* 23 double Acoef=Acoefs[j];
blockIdx.x */ 24 double xcoef=xcoefs[Acol];
21, #pragma acc loop vector(128) /* 25 sum+=Acoef*xcoef;
threadIdx.x */ 26 }
Generating reduction(+:sum) 27 ycoefs[i]=sum;
21, Loop is parallelizable 28 }
29 }
33
Optimizing Matvec Loops
matvec(const matrix &, const vector &, 14 #pragma acc parallel loop \
const vector &): 15 copyout(ycoefs[:num_rows])
8, include "matrix_functions.h" copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols[:A.n
12, Generating copyin(Acoefs[:A->nnz], nz])
cols[:A->nnz]) 16 for(int i=0;i<num_rows;i++) {
Generating implicit 17 double sum=0;
copyin(row_offsets[:num_rows+1]) 18 int row_start=row_offsets[i];
Generating copyin(xcoefs[:num_rows]) 19 int row_end=row_offsets[i+1];
Generating copyout(ycoefs[:num_rows]) 20 #pragma acc loop reduction(+:sum)
Accelerator kernel generated 21 for(int j=row_start;j<row_end;j++) {
Generating Tesla code 22 unsigned int Acol=cols[j];
16, #pragma acc loop gang /* 23 double Acoef=Acoefs[j];
blockIdx.x */ 24 double xcoef=xcoefs[Acol];
21, #pragma acc loop vector(128) /* 25 sum+=Acoef*xcoef;
threadIdx.x */ 26 }
Generating reduction(+:sum) 27 ycoefs[i]=sum;
21, Loop is parallelizable
The compiler
28 }
is vectorizing 128
iterations
29 } of this loop. How many
iterations does it really do?
34
Optimizing Matvec Loops (cont.)
The compiler does not know how
many iterations the inner loop
will do, so it chooses a default 14 void allocate_3d_poisson_matrix(matrix
value of 128. &A, int N) {
15 int num_rows=(N+1)*(N+1)*(N+1);
We can see in the initialization 16
17
int nnz=27*num_rows;
A.num_rows=num_rows;
routine that it will only iterate 18 A.row_offsets=(unsigned
27 times (number of non-zeros int*)malloc((num_rows+1)*sizeof(unsigned
int));
per row). 19 A.cols=(unsigned
int*)malloc(nnz*sizeof(unsigned int));
Reducing the vector length 20
should improve hardware A.coefs=(double*)malloc(nnz*sizeof(double)
);
utilization
35
OpenACC: 3 Levels of Parallelism
• Vector threads work in
Vector Workers lockstep (SIMD/SIMT
parallelism)
Gang • Workers compute a vector
• Gangs have 1 or more
workers and share resources
Vector Workers (such as cache, the
streaming multiprocessor,
Gang etc.)
• Multiple gangs work
independently of each other

36
OpenACC gang, worker, vector Clauses

gang, worker, and vector can be added to a loop clause

A parallel region can only specify one of each gang, worker, vector
Control the size using the following clauses on the parallel region
num_gangs(n), num_workers(n), vector_length(n)

#pragma acc parallel loop gang #pragma acc parallel vector_length(32)

for (int i = 0; i < n; ++i) #pragma acc loop gang worker
#pragma acc loop vector for (int i = 0; i < n; ++i)
for (int j = 0; j < n; ++j) #pragma acc loop vector
... for (int j = 0; j < n; ++j)
...

38
Optimizing Matvec Loops: Vector Length
Use the OpenACC loop 14 #pragma acc parallel loop
vector_length(32) \
directive to force the compiler 15 copyout(ycoefs[:num_rows])
to vectorizer the inner loop. copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols
[:A.nnz])
Use vector_length to reduce 16 for(int i=0;i<num_rows;i++) {
17 double sum=0;
the vector length closer to 18 int row_start=row_offsets[i];
actual loop iterations 19 int row_end=row_offsets[i+1];
20 #pragma acc loop vector reduction(+:sum)
21 for(int j=row_start;j<row_end;j++) {
22 unsigned int Acol=cols[j];
Note: NVIDIA GPUs need 23
24
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol];
vector lengths that are 25 sum+=Acoef*xcoef;
multiples of 32 (warp size) 26 }
27 ycoefs[i]=sum;
28 }
39
Matvec Performance Limiter

40
Matvec Performance Limiter
Instruction and Memory
latency are limiting kernel
performance.

41
Matvec Performance Limiter
Instruction and Memory
latency are limiting kernel
performance.

Recall: GPU’s tolerate

latency by having
enough parallelism

42
Matvec Performance Limiter

PGPROF will guide you to

performance a latency
analysis to understand why

43
Matvec Occupancy

Occupancy is a measure of how

well the GPU is being utilized.

100% occupancy means the GPU is

running as many simultaneous
threads as it can.

44
Matvec Occupancy

Occupancy is a measure of how

well the GPU is being utilized.

100% occupancy means the GPU is

running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%

occupied, why?

45
Matvec Occupancy
Occupancy is a measure of how
well the GPU is being utilized.

100% occupancy means the GPU is

running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%

occupied, why?

We need 64 threadblocks to get

100% occupancy, but the hardware
can only manage 16. Why?

46
Matvec Occupancy
Occupancy is a measure of how
well the GPU is being utilized.

100% occupancy means the GPU is

running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%

occupied, why?

We need 64 threadblocks to get

100% occupancy, but the hardware
can only manage 16. Why?

We’ve reduced our vector length

so that each block has only 1 warp

47
Matvec Occupancy Occupancy is a measure of how
well the GPU is being utilized.

100% occupancy means the GPU is

running as many simultaneous
threads as it can.

We’re only keeping the GPU 25%

occupied, why?

We need 64 threadblocks to get

100% occupancy, but the hardware
can only manage 16. Why?

We’ve reduced our vector length

so that each block has only 1 warp

So we need at least 4X more

parallelism per gang
48
Increasing per-gang parallelism
• The inner loop lacks
Vector Workers sufficient parallelism to
occupy the GPU
Gang
• We can increase the size of
the gang by increasing the
Vector Workers number of workers

Gang

49
Optimizing Matvec Loops: Increase Workers
By splitting the iterations of 14 #pragma acc parallel loop gang worker \
num_workers(4) vector_length(32) \
the outer loop among workers 15 copyout(ycoefs[:num_rows])
and gangs, we’ll increase the copyin(Acoefs[:A.nnz],xcoefs[:num_rows],cols
size of the gangs. [:A.nnz])
16 for(int i=0;i<num_rows;i++) {
17 double sum=0;
The compiler will handle 18 int row_start=row_offsets[i];
breaking up the outer loop so 19 int row_end=row_offsets[i+1];
you don’t have to. 20 #pragma acc loop vector reduction(+:sum)
21 for(int j=row_start;j<row_end;j++) {
22 unsigned int Acol=cols[j];
Now we should have 4x32 = 23 double Acoef=Acoefs[j];
128 threads in each GPU 24 double xcoef=xcoefs[Acol];
threadblock 25 sum+=Acoef*xcoef;
26 }
27 ycoefs[i]=sum;
28 }
50
Increase Workers: Compiler feedback
matvec(const matrix &, const vector &, const
vector &):
8, include "matrix_functions.h"
The compiler will tell you that it 12, Generating copyin(Acoefs[:A-
has honored your loop directives. >nnz],cols[:A->nnz])
Generating implicit
copyin(row_offsets[:num_rows+1])
Generating copyin(xcoefs[:num_rows])
Generating copyout(ycoefs[:num_rows])
Accelerator kernel generated
If you’re familiar with CUDA, it’ll Generating Tesla code
also tell you how the loops are 5, Vector barrier inserted due to
potential dependence into a vector loop
mapped the CUDA thread blocks 16, #pragma acc loop gang, worker(4)
/* blockIdx.x threadIdx.y */
21, #pragma acc loop vector(32) /*
threadIdx.x */
Generating reduction(+:sum)
Vector barrier inserted due to
potential dependence out of a vector loop
21, Loop is parallelizable
51
Tuning Matvec
Original Vector Length 32 Num Workers 32
5.00X

3.20X
Speed-up from serial

1.39X
1.00X

0.00X
Original Vector Length 32 Num Workers 32

Source: PGI 16.9, NVIDIA Tesla K80 52

Final Performance
Serial Multicore K80 (single) P100
25.00X

20.00X

15.00X

10.00X
Speed-up from serial

5.00X

0.00X
Serial Multicore K80 (single) P100

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 53

The device_type clause

Use device_type to #pragma acc parallel loop \

specialize device_type(nvidia) vector_length(256)
optimizations to device_type(radeon) vector_length(512)
specific hardware. for(int i = 0; i < n ; i++)
The compiler will {
choose values for ...;
all other targets. }

55
Common optimizations

56
The collapse Clause
collapse(n): Applies the associated directive to the following n
tightly nested loops.

#pragma acc parallel loop \ #pragma acc parallel loop

collapse(2) for(int ij=0; ij<N*N; ij++)
for(int i=0; i<N; i++) ...
for(int j=0; j<M; j++)
...

Collapse outer loops to enable creating more gangs.

Collapse inner loops to enable longer vector lengths.
Collapse all loops, when possible, to do both.
57
The tile clause
Operate on smaller blocks of the operation to exploit data locality

#pragma acc loop tile(4,4)

for(i = 1; i <= ROWS; i++) {
for(j = 1; j <= COLUMNS; j++) {
Temp[i][j] = 0.25 *
(Temp_last[i+1][j] +
Temp_last[i-1][j] +
Temp_last[i][j+1] +
Temp_last[i][j-1]);
}
}
58
Stride-1 Memory Accesses

for(i=0; i<N; i++) for(i=0; i<N; i++)

for(j=0; j<M; j++) for(j=0; j<M; j++)
{ {
A[i][j][1] = 1.0f; A[1][i][j] = 1.0f;
A[i][j][2] = 0.0f; A[2][i][j] = 0.0f;
} }
} }

The fastest dimension is length 2 Now the inner loop is the fastest
and fastest loop strides by 2. dimension through memory.
59
Stride-1 Memory Accesses

for(i=0; i<N; i++) for(i=0; i<N; i++)

for(j=0; j<M; j++) for(j=0; j<M; j++)
{ {
A[i][j].a= 1.0f; Aa[i][j] = 1.0f;
A[i][j].b = 0.0f; Ab[i][j] = 0.0f;
} }
} }

If all threads access the “a” Now all threads are access
element, they will be accesses contiguous elements of Aa and Ab.
every-other memory element.
60
Using QwikLabs

61
Getting access

1. Create an account with NVIDIA qwikLABS

https://developer.nvidia.com/qwiklabs-
signup
2. Enter a promo code OPENACC16 before
submitting the form
3. Free credits will be added to your account
4. Start using OpenACC!

62
This week’s labs
This week you should
complete the “Profiling
and Parallelizing with
OpenACC” and
“Expressing Data
Movement and
Optimizing Loops with
OpenACC” labs in
qwiklabs.

63
CERTIFICATION OPENACC TOOLKIT
Available after November 9th Free for Academia

1. Attend live lectures Download link:

https://developer.nvidia.com/openacc-toolkit
2. Complete the test
3. Enter for a chance to win a
Titan X or an OpenACC Book

NEW OPENACC BOOK

Parallel Programming with OpenACC

Available starting Nov 1st, 2016:

http://store.elsevier.com/Parallel-Programming-
with-OpenACC/Rob-Farber/isbn-9780124103979/
Official rules:
http://developer.download.nvidia.com/compute/OpenACC-
64
Toolkit/docs/TITANX-GIVEAWAY-OPENACC-Official-Rules-2016.pdf
Where to find help
• OpenACC Course Recordings - https://developer.nvidia.com/openacc-courses
• PGI Website - http://www.pgroup.com/resources
• OpenACC on StackOverflow - http://stackoverflow.com/questions/tagged/openacc
• OpenACC Toolkit - http://developer.nvidia.com/openacc-toolkit
• Parallel Forall Blog - http://devblogs.nvidia.com/parallelforall/
• GPU Technology Conference - http://www.gputechconf.com/
• OpenACC Website - http://openacc.org/

Questions? Email [email protected] 65

Oct 26: Analyzing and Parallelizing with OpenACC

Course Syllabus Nov 2: OpenACC Optimizations

Nov 9: Advanced OpenACC

Recordings: Questions? Email [email protected]

66
https://developer.nvidia.com/intro-to-openacc-course-2016

Final Exam Basic Arduino Workshop PDF
100% (3)
Final Exam Basic Arduino Workshop PDF
41 pages
Wiley Sec F 2021 MCQ
No ratings yet
Wiley Sec F 2021 MCQ
378 pages
D3.2 Part1 Guidelines Dependability Hazard Analysis
No ratings yet
D3.2 Part1 Guidelines Dependability Hazard Analysis
340 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
Module 5
No ratings yet
Module 5
30 pages
Module 4
No ratings yet
Module 4
40 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Web GPU
0% (1)
Web GPU
40 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)
No ratings yet
Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)
21 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Directives Tips For Fortran
No ratings yet
Directives Tips For Fortran
15 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Coursera Lecture 11.1 OpenACC Intro
No ratings yet
Coursera Lecture 11.1 OpenACC Intro
11 pages
Clase de Progrea 555
No ratings yet
Clase de Progrea 555
35 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Modeling A Non-Uniform Memory Access Architecture For Optimizing
No ratings yet
Modeling A Non-Uniform Memory Access Architecture For Optimizing
79 pages
PDC Lecture 04
No ratings yet
PDC Lecture 04
44 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
No ratings yet
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
60 pages
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
No ratings yet
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
201 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Optimization Techniques Code Optimizations
No ratings yet
Optimization Techniques Code Optimizations
10 pages
OpenMP Workshop Day 3
No ratings yet
OpenMP Workshop Day 3
116 pages
Module 2
No ratings yet
Module 2
50 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
CDU5
No ratings yet
CDU5
15 pages
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
A Data Structure Optimizing Compiler For tUPL
No ratings yet
A Data Structure Optimizing Compiler For tUPL
102 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
CS 294-73 Software Engineering For Scientific Computing Lecture 14: Development For Performance
No ratings yet
CS 294-73 Software Engineering For Scientific Computing Lecture 14: Development For Performance
40 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
2 DataflowAnalysis
No ratings yet
2 DataflowAnalysis
49 pages
Class 7
No ratings yet
Class 7
36 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
High Speed electronics-UoH - 4-Vivado-Presentation
No ratings yet
High Speed electronics-UoH - 4-Vivado-Presentation
66 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
C++ AMP Errata: Chapter 1: Vectorization (Page 8)
No ratings yet
C++ AMP Errata: Chapter 1: Vectorization (Page 8)
3 pages
KokkosTutorial ORNL20
No ratings yet
KokkosTutorial ORNL20
322 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Addressing Modes
No ratings yet
Addressing Modes
4 pages
Aph Data Structures
No ratings yet
Aph Data Structures
72 pages
OpenACC 2
No ratings yet
OpenACC 2
44 pages
Assignment 3 - COMP2129
No ratings yet
Assignment 3 - COMP2129
4 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Parallel Programming Models: Sathish Vadhiyar
No ratings yet
Parallel Programming Models: Sathish Vadhiyar
32 pages
MIT6 172F09 Lec02
No ratings yet
MIT6 172F09 Lec02
85 pages
Loops
No ratings yet
Loops
26 pages
HPCG sc19
No ratings yet
HPCG sc19
2 pages
Evolving Genechip Correlation Predictors On Parallel Graphics Hardware
No ratings yet
Evolving Genechip Correlation Predictors On Parallel Graphics Hardware
6 pages
Storti Foundation: Haematologica
No ratings yet
Storti Foundation: Haematologica
5 pages
Redux: #React Notes
No ratings yet
Redux: #React Notes
24 pages
Cs301 Solved Subjective Final Term by Junaid
No ratings yet
Cs301 Solved Subjective Final Term by Junaid
39 pages
Art Models 5 Life Nude Photos For The Visual Arts (Art Models Series) (Johnson, Maureen Douglas Johnson) (Z-Library)
No ratings yet
Art Models 5 Life Nude Photos For The Visual Arts (Art Models Series) (Johnson, Maureen Douglas Johnson) (Z-Library)
231 pages
Ol8 Relnotes8 8
No ratings yet
Ol8 Relnotes8 8
111 pages
Introduction To Requirement Engineering Requirements:: 1. Milk
0% (1)
Introduction To Requirement Engineering Requirements:: 1. Milk
2 pages
DS-2CD3121G0-I 2 MP IR Fixed Network Dome Camera: Key Features
No ratings yet
DS-2CD3121G0-I 2 MP IR Fixed Network Dome Camera: Key Features
4 pages
Difference Between SAP Memory and ABAP Memory: Answers 1
No ratings yet
Difference Between SAP Memory and ABAP Memory: Answers 1
2 pages
Understanding Windows Server Administration - Level 100 - Document
No ratings yet
Understanding Windows Server Administration - Level 100 - Document
137 pages
Chapter 4: Automating Active Directory Domain Services Administration
No ratings yet
Chapter 4: Automating Active Directory Domain Services Administration
19 pages
Eller Santé® USB C Hub Hyperion 9IN1 Type C Hub
No ratings yet
Eller Santé® USB C Hub Hyperion 9IN1 Type C Hub
1 page
Getting Started nRF5SDK Ses
No ratings yet
Getting Started nRF5SDK Ses
39 pages
Project Lab
No ratings yet
Project Lab
2 pages
Zeus - Technical Specification - Eng - v101 - PP PDF
No ratings yet
Zeus - Technical Specification - Eng - v101 - PP PDF
70 pages
DAA Lab - Practical No. 01 - AI&DS
No ratings yet
DAA Lab - Practical No. 01 - AI&DS
4 pages
TADM10 - 2 Test Questions
No ratings yet
TADM10 - 2 Test Questions
7 pages
Cisco UCS C240 M4 Server Installation and Service Guide - GPU Card Installation (Cisco UCS C-Series Rack Servers) - Cisco
No ratings yet
Cisco UCS C240 M4 Server Installation and Service Guide - GPU Card Installation (Cisco UCS C-Series Rack Servers) - Cisco
28 pages
Add Games Bios Files To Batocera
No ratings yet
Add Games Bios Files To Batocera
25 pages
ML-2 Quick Start Guide 4189340603 UK
No ratings yet
ML-2 Quick Start Guide 4189340603 UK
23 pages
h14385 Introduction Vnxe1600 WP
No ratings yet
h14385 Introduction Vnxe1600 WP
27 pages
Mcs 023
No ratings yet
Mcs 023
261 pages
Sara T Chandra
No ratings yet
Sara T Chandra
3 pages
CN Lab Manual
75% (4)
CN Lab Manual
34 pages
ChemSpider The Free Chemistry Database F PDF
No ratings yet
ChemSpider The Free Chemistry Database F PDF
42 pages
DSEG8680-Software-Manual DEEP SEA G8 Series
No ratings yet
DSEG8680-Software-Manual DEEP SEA G8 Series
128 pages
MLISc GuessPaper Full ModelAnswers
No ratings yet
MLISc GuessPaper Full ModelAnswers
4 pages
CyberAces Module1-Windows 7 Registry
No ratings yet
CyberAces Module1-Windows 7 Registry
13 pages

Introduction To OpenACC Course 20161102 1530 1

Uploaded by

Introduction To OpenACC Course 20161102 1530 1

Uploaded by

INTRODUCTION TO OPENACC

Lecture 2: OpenACC Optimizations, November 2, 2016

Enable you to accelerate your applications

Course Syllabus Nov 2: OpenACC Optimizations

Case Study: Conjugate Gradient

A sample code implementing the conjugate

You do not need to understand the algorithm to

Obtain a performance profile

Insert OpenACC directives

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 12

Get new performance data from

Remove the “managed” PGCC-S-0155-Compiler failed to translate accelerator

Now the compiler aborts 12, Accelerator kernel generated

matvec function Generating reduction(+:sum)

create ( list ) Allocates memory on GPU but does not copy.

delete( list ) Deallocate memory on the GPU without copying. (Unstructured

Compiler sometimes cannot determine size of arrays

#pragma acc data

#pragma acc enter data copyin(a)

Note: Update “host” has been deprecated and renamed “self”

void initialize_vector(vector &v,double val)

Update device : CPU -> GPU

Now let’s look at how our iterations get mapped to hardware.

gang, worker, and vector can be added to a loop clause

#pragma acc parallel loop gang #pragma acc parallel vector_length(32)

Recall: GPU’s tolerate

PGPROF will guide you to

Occupancy is a measure of how

100% occupancy means the GPU is

Occupancy is a measure of how

100% occupancy means the GPU is

We’re only keeping the GPU 25%

100% occupancy means the GPU is

We’re only keeping the GPU 25%

We need 64 threadblocks to get

100% occupancy means the GPU is

We’re only keeping the GPU 25%

We need 64 threadblocks to get

We’ve reduced our vector length

100% occupancy means the GPU is

We’re only keeping the GPU 25%

We need 64 threadblocks to get

We’ve reduced our vector length

So we need at least 4X more

Source: PGI 16.9, NVIDIA Tesla K80 52

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 53

Use device_type to #pragma acc parallel loop \

#pragma acc parallel loop \ #pragma acc parallel loop

Collapse outer loops to enable creating more gangs.

#pragma acc loop tile(4,4)

for(i=0; i<N; i++) for(i=0; i<N; i++)

for(i=0; i<N; i++) for(i=0; i<N; i++)

1. Create an account with NVIDIA qwikLABS

1. Attend live lectures Download link:

NEW OPENACC BOOK

Available starting Nov 1st, 2016:

Questions? Email [email protected] 65

Course Syllabus Nov 2: OpenACC Optimizations

Recordings: Questions? Email [email protected]

You might also like