0% found this document useful (0 votes)

107 views47 pages

Openacc Online Course: Lecture 1: Introduction To Openacc

The document provides an introduction to the OpenACC online course, including: - The course objective is to enable attendees to start accelerating applications with OpenACC. - The course syllabus covers an introduction to OpenACC on October 19th and GPU programming and optimizing with OpenACC on subsequent dates. - OpenACC is a directive-based approach for performance and portability on CPUs and accelerators using compiler directives to generate parallel code.

Uploaded by

Quant_Geek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views47 pages

Openacc Online Course: Lecture 1: Introduction To Openacc

Uploaded by

Quant_Geek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

OPENACC ONLINE COURSE

Lecture 1: Introduction to OpenACC

John Urbanic, Pittsburgh Supercomputing Center

October 19, 2017
COURSE OBJECTIVE:

Enable you to start accelerating your

applications with OpenACC.
COURSE SYLLABUS:

October 19: Introduction to OpenACC

October 26: GPU Programming with OpenACC
November 2: Optimizing and Best Practices for OpenACC

Questions? Email [email protected]

WHAT IS OPENACC
OpenACC is a directives- Add Simple Compiler Directive
based programming approach
main()
to parallel computing {
<serial code>
#pragma acc kernels
designed for performance {
<parallel code>
}
and portability on CPUs }

and accelerators for HPC.

OPENACC DIRECTIVES
Parallel Hardware
CPU
1. Simple compiler hints from programmer
2. Compiler generates parallel threaded code
3. Ignorant compiler just sees some comments.

int main(){

<sequential code>
Incremental
Compiler
#pragma acc kernels Directive
Single Source
{ Low Learning
<parallel code> Curve
}

}
More on this later!
SINGLE CODE FOR MULTIPLE PLATFORMS
OpenACC - Performance Portable Programming Model for HPC
AWE Hydrodynamics CloverLeaf mini-App, bm32 data set
80x 77x
PGI OpenACC

OpenPOWER Intel OpenMP

Speedup vs Single Haswell Core

60x IBM OpenMP
52x
Sunway
x86 CPU 40x

x86 Xeon Phi

20x
NVIDIA GPU 9x 10x 10x 11x 11x
9x

PEZY-SC
0x Dual Haswell 1 Tesla 1 Tesla
Dual Broadwell Dual POWER8
P100 V100

Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Minsky: POWER8+NVLINK, four P100s,
RHEL 7.3 (gsn1).
Compilers: Intel 17.0, IBM XL 13.1.3, PGI 16.10, KNL: Compiler version: 17.0.1 20161005,
Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC
(MPI+OpenACC)
Data compiled by PGI November 2016, Volta data collected June 2017
TOP HPC APPS ADOPTING OPENACC
ANSYS Fluent ● Gaussian ● VASP ● GTC ● XGC ● ACME ● FLASH ● LSDalton ● COSMO ● ELEPHANT ● RAMSES ● ICON ● ORB5

ANSYS Fluent Gaussian 16

ANSYS Fluent R18.0 Radiation Solver Valinomycin wB97xD/6-311+(2d,p) Freq
30000
Fluent Native Solver
Fluent HTC Solver K80 GPU
22500

5.15X speedup 2.25X speedup

Time
(S)

15000

7500

0
T4 T8 T14 T28
CPU
(cores)
Hardware: HPE server with dual Intel Xeon E5-2698 v3 CPUs (2.30GHz ; 16 cores/chip),
256GB memory and 4 Tesla K80 dual GPU boards (boost clocks: MEM 2505 and SM 875).
CPU: (Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @2.30GHz, 2 sockets, 28 cores Gaussian source code compiled with PGI Accelerator Compilers (16.5) with OpenACC (2.5
GPU: Tesla K80 12+12 GB, Driver 346.46 standard).
FAMILIAR TO OPENMP PROGRAMMERS
CPU CPU Parallel Hardware

main() {
double pi = 0.0; long i; main() {
double pi = 0.0; long i;

#pragma omp parallel for reduction(+:pi) #pragma acc kernels

for (i=0; i<N; i++) for (i=0; i<N; i++)
{ {
double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t);
} }

printf(“pi = %f\n”, pi/N); printf(“pi = %f\n”, pi/N);

} }

More on this later!

3 WAYS TO ACCELERATE
APPLICATIONS

Applications

Compiler Programming
Libraries
Directives Languages

Easy to use Easy to use Most Performance

Most Performance Portable code Most Flexibility
CUDA, OpenCL
OPENACC: KEY ADVANTAGES
• High-level. Minimal modifications to the code. Less than with OpenCL, CUDA, etc. Non-
GPU programmers can play along.

• Single source. No GPU-specific code. Compile the same program for accelerators or
serial.

• Efficient. Experience shows very favorable comparison to low-level implementations of

same algorithms.

• Performance portable. Supports CPUs, GPU accelerators and co-processors from

multiple vendors, current and future versions.

• Incremental. Developers can port and tune parts of their application as resources and
profiling dictates. No wholesale rewrite required. Which can be quick.
TRUE OPEN STANDARD
• Full OpenACC 1.0 and 2.0 and now 2.5
specifications available at OpenACC.org Members
http://www.openacc.org

• Quick reference card also available and

useful: https://www.openacc.org/resources
• Implementations available now from PGI,
Cray, and GCC: https://www.openacc.org/tools
• GCC version of OpenACC in 5.x and 6.x,
but use 7.x: https://www.openacc.org/tools
• Best free option is very probably PGI
Community Edition:
http://www.pgroup.com/products/community.htm
LSDalton PowerGrid COSMO INCOMP3D

Quantum Chemistry Medical Imaging Weather and Climate CFD

Aarhus University University of Illinois MeteoSwiss, CSCS NC State University
12X speedup 40 days to 4X speedup
1 week 2 hours 3X energy efficiency 4X speedup

MAESTRO
NekCEM CASTRO CloverLeaf FINE/Turbo

CFD
Comp Electromagnetics Astrophysics Comp Hydrodynamics NUMECA
Argonne National Lab Stony Brook University AWE International
2.5X speedup 4.4X speedup 4X speedup 10X faster routines
60% less energy 4 weeks effort Single CPU/GPU code 2X faster app
OPENACC DIRECTIVES
A SIMPLE EXAMPLE: SAXPY
SAXPY in C SAXPY in Fortran

void saxpy(int n, subroutine saxpy(n, a, x, y)

float a, real :: x(:), y(:), a
float *x, integer :: n, i
float *restrict y) !$acc kernels
do i=1,n
{
y(i) = a*x(i)+y(i)
#pragma acc kernels
enddo
for (int i = 0; i < n; ++i) !$acc end kernels
y[i] = a*x[i] + y[i]; end subroutine saxpy
}

... ...
// Somewhere in main $ From main program
// call SAXPY on 1M elements $ call SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y); call saxpy(2**20, 2.0, x_d, y_d)
...
...
KERNELS: OUR FIRST OPENACC DIRECTIVE
We request that each loop execute as a separate kernel on the GPU. This is an
incredibly powerful directive.

!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0 Kernel:
kernel 1 A parallel routine to run
c(i) = 2.0
end do on the parallel
hardware
do i=1,n
a(i) = b(i) + c(i)
end do kernel 2
!$acc end kernels
GENERAL DIRECTIVE SYNTAX AND SCOPE

C Fortran

#pragma acc kernels [clause …] !$acc kernels [clause …]

{ structured block
structured block !$acc end kernels
}

I may indent the directives at the natural code indentation level for readability. It is a
common practice to always start them in the first column (ala #define/#ifdef). Either is
fine with C or Fortran 90 compilers.
COMPLETE SAXPY EXAMPLE CODE
int main(int argc, char **argv) “I promise y is not
{ aliased by
int N = 1<<20; // 1 million floats Anything else (esp. x)”
#include <stdlib.h>
if (argc > 1)
N = atoi(argv[1]);
void saxpy(int n,
float a,
float *x = (float*)malloc(N * sizeof(float));
float *x,
float *y = (float*)malloc(N * sizeof(float));
for (int i = 0; i < N; ++i) { float *restrict y)
x[i] = 2.0f; {
y[i] = 1.0f; #pragma acc kernels
} for (int i = 0; i < n; ++i)
saxpy(N, 3.0f, x, y); y[i] = a * x[i] + y[i];
}
return 0;
}
C DETAIL: THE “RESTRICT” KEYWORD
• Standard C (as of C99).
• Important for optimization of serial as well as OpenACC and OpenMP code.
• Promise given by the programmer to the compiler for a pointer: float *restrict ptr
Meaning: “for the lifetime of ptr, only it or a value directly derived from it (such as ptr + 1) will be used to access the object to
which it points”

• Limits the effects of pointer aliasing

• OpenACC compilers often require restrict to determine independence
• Otherwise the compiler can’t parallelize loops that access ptr
• Note: if programmer violates the declaration, behavior is undefined
COMPILE AND RUN
• C: pgcc –acc saxpy.c -acc will enable OpenACC directives
Default here targets serial CPU and GPU
• Fortran: pgf90 –acc saxpy.f90

• Run: a.out
Compiler Output
pgcc -acc -Minfo=accel -ta=tesla saxpy.c -ta=tesla will only target a GPU
-ta=multicore will only target a multicore CPU
saxpy: -Minfo=accel turns on helpful compiler reporting
8, Generating copyin(x[:n-1])
Generating copy(y[:n-1])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
9, Loop is parallelizable
Accelerator kernel generated
9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */
CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy
COMPARE: OPENACC AND CUDA IMPLEMENTATIONS
OpenACC: CUDA:
Complete SAXPY Example Code Partial CUDA C SAXPY Code SAXPY Example Code
#include <stdlib.h>
__global__ void saxpy_kernel( float a, module kmod
void saxpy(int n, float* x, float* y, int n ){ use cudafor
float a, int i; contains
attributes(global) subroutine
float *x, i = blockIdx.x*blockDim.x +
saxpy_kernel(A,X,Y,N)
float *restrict y) threadIdx.x; real(4), device :: A, X(N), Y(N)
{ if( i <= n ) x[i] = a*x[i] + y[i]; integer, value :: N
#pragma acc kernels } integer :: i
for (int i = 0; i < n; ++i) void saxpy( float a, float* x, float* y, i = (blockidx%x-1)*blockdim%x + threadidx%x
y[i] = a * x[i] + y[i]; int n ){ if( i <= N ) X(i) = A*X(i) + Y(i)
} float *xd, *yd; end subroutine
cudaMalloc( (void**)&xd, end module
int main(int argc, char **argv) n*sizeof(float) );
{ cudaMalloc( (void**)&yd, subroutine saxpy( A, X, Y, N )
int N = 1<<20; // 1 million floats n*sizeof(float) ); cudaMemcpy( xd, x, use kmod
if (argc > 1) n*sizeof(float), real(4) :: A, X(N), Y(N)
N = atoi(argv[1]); integer :: N
cudaMemcpyHostToDevice ); real(4), device, allocatable, dimension(:)::
float *x = (float*)malloc(N * sizeof(float)); cudaMemcpy( yd, y, n*sizeof(float), &
Xd, Yd
float *y = (float*)malloc(N * sizeof(float));
allocate( Xd(N), Yd(N) )
for (int i = 0; i < N; ++i) { cudaMemcpyHostToDevice );
Xd = X(1:N)
x[i] = 2.0f; saxpy_kernel<<< (n+31)/32, 32 >>>( a, Yd = Y(1:N)
y[i] = 1.0f; xd, yd, n ); call saxpy_kernel<<<(N+31)/32,32>>>(A, Xd,
} cudaMemcpy( x, xd, n*sizeof(float), Yd, N)
saxpy(N, 3.0f, x, y); X(1:N) = Xd
cudaMemcpyDeviceToHost ); deallocate( Xd, Yd )
cudaFree( xd ); cudaFree( yd ); end subroutine
return 0;
} }
BIG DIFFERENCE!
OpenACC vs CUDA implementations

• CUDA: Hard to Maintain, OpenACC: Easy to Maintain

With CUDA, we changed the structure of the old code. Non-CUDA programmers
can’t understand new code. It is not even ANSI standard code.

• CUDA: Rewrite Original Code, OpenACC: Augment Original Code

We have separate sections for the host code and the GPU code. Different flow of
code. Serial path now gone forever.

• CUDA: Optimized for Specific Hardware, OpenACC: One Source Everywhere

Where did these “32”s and other mystery numbers come from? This is a clue that we
have some hardware details to deal with here.

• CUDA: Assembler-like Programming, OpenACC: Relies on Compiler

Exact same situation as assembly used to be. How much hand-assembled code is
still being written in HPC now that compilers have gotten so efficient?
THIS LOOKS EASY! TOO EASY…
Questions:
1. If it is this simple, why don’t we just throw kernels in front of every loop?
2. Better yet, why doesn’t the compiler do this for me?
Answers:
There are two general issues that prevent the compiler from being able to just
automatically parallelize every loop:
1. Data Dependencies in Loops
2. Data Movement
The compiler needs your higher level perspective (in the form of directive hints) to get
correct results and reasonable performance
DATA DEPENDENCIES
DATA DEPENDENCIES
Most directive based parallelization consists of splitting up big do/for loops into
independent chunks that the many processors can work on simultaneously.
Take, for example, a simple for loop like this:

for(index=0, index<1000000,index++)
Array[index] = 4 * Array[index];

When run on 1000 processors, it will execute something like this…

NO DATA DEPENDENCIES
A run on 1000 processors for the loop below

for(index=0, index<1000000,index++)
Array[index] = 4 * Array[index];

Processor 1 Processor 2 Processor 3

for(index=0, index<999,index++) for(index=1000, index<1999,index++) for(index=2000, index<2999,index++)

….
Array[index] = 4*Array[index]; Array[index] = 4*Array[index]; Array[index] = 4*Array[index];

Processor 4 Processor 5
for(index=3000, index<3999,index++) for(index=4000, index<4999,index++)
Array[index] = 4*Array[index]; Array[index] = 4*Array[index];
WITH DATA DEPENDENCIES
But what if the loops are not entirely independent?
Take, for example, a similar loop like this:

for(index=1, index<1000000,index++)
Array[index] = 4 * Array[index] – Array[index-1];

Added data dependency

This is a perfectly valid serial code.

WITH DATA DEPENDENCIES

Processor 1
for(index=0, index<999,index++)
Array[index] = 4*Array[index]-
Array[index-1];
Processor 2
for(index=1000, index<1999,index++)
Array[index] = 4*Array[index]-
….
Array[index-1];

Result from Processor 1

for(index=1000, index<1999,index++)
Array[1000] = 4 * Array[1000] – Array[999];

Needs the result of Processor 1’s last iteration.

If we want the correct (“same as serial”) result, we need to wait until

processor 1 finishes. Likewise for processors 3, 4, …
DATA DEPENDENCIES
If the compiler even suspects that there is a data dependency, it will, for the sake of
correctness, refuse to parallelize that loop.

11, Loop carried dependence of 'Array' prevents parallelization

Loop carried backward dependence of 'Array' prevents vectorization

As large, complex loops are quite common in HPC, especially around the most
important parts of your code, the compiler will often balk most when you most need
a kernel to be generated. What can you do?
HOW TO MANAGE DATA DEPENDENCIES
• Rearrange your code to make it more obvious to the compiler that there is not
really a data dependency.
• Eliminate a real dependency by changing your code.
• There is a common bag of tricks developed for this as this issue goes back 40
years in HPC. Many are quite trivial to apply.
• The compilers have gradually been learning these themselves.
• Override the compiler’s judgment (independent clause) at the risk of invalid
results. Misuse of restrict has similar consequences.
EXERCISES
FOUNDATION EXERCISE: LAPLACE SOLVER
• I’ve been using this for MPI, OpenMP and now OpenACC. It is a great simulation
problem, not rigged for OpenACC.
• In this most basic form, it solves the Laplace equation: 𝛁𝟐 𝒇(𝒙, 𝒚) = 𝟎
• The Laplace Equation applies to many physical problems, including: electrostatics,
fluid flow, and temperature
• For temperature, it is the Steady State Heat Equation:
Initial Conditions Final Steady State

Metal Metal
Plate Plate

Heating
Element
EXERCISE FOUNDATION: JACOBI ITERATION
• The Laplace equation on a grid states that each grid point is the average of it’s
neighbors.
• We can iteratively converge to that state by repeatedly computing new values at
each point from the average of neighboring points.
• We just keep doing this until the difference from one pass to the next is small
enough for us to tolerate.

A(i,j+1)

𝐴𝑘 (𝑖 − 1, 𝑗) + 𝐴𝑘 𝑖 + 1, 𝑗 + 𝐴𝑘 𝑖, 𝑗 − 1 + 𝐴𝑘 𝑖, 𝑗 + 1
𝐴𝑘+1 𝑖, 𝑗 =
A(i-1,j) A(i,j) A(i+1,j) 4

A(i,j-1)
SERIAL CODE IMPLEMENTATION
C
for(i = 1; i <= ROWS; i++) {
for(j = 1; j <= COLUMNS; j++) {
Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +
Temperature_last[i][j+1] + Temperature_last[i][j-1]);
}
}

Fortran
do j=1,columns
do i=1,rows
temperature(i,j)= 0.25 * (temperature_last(i+1,j)+temperature_last(i-1,j) + &
temperature_last(i,j+1)+temperature_last(i,j-1) )
enddo
enddo
SERIAL C CODE (KERNEL)
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { Done?
for(i = 1; i <= ROWS; i++) {
for(j = 1; j <= COLUMNS; j++) {
Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] + Calculate
Temperature_last[i][j+1] + Temperature_last[i][j-1]);
}
}

dt = 0.0;
Update
for(i = 1; i <= ROWS; i++){ temp
for(j = 1; j <= COLUMNS; j++){
dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);
array and
Temperature_last[i][j] = Temperature[i][j]; find max
} change
}

if((iteration % 100) == 0) { Output

track_progress(iteration);
}

iteration++;

}
SERIAL C CODE SUBROUTINES
void initialize(){ void track_progress(int iteration) {

int i,j; int i;

for(i = 0; i <= ROWS+1; i++){ printf("-- Iteration: %d --\n", iteration);

for (j = 0; j <= COLUMNS+1; j++){ for(i = ROWS-5; i <= ROWS; i++) {
Temperature_last[i][j] = 0.0; printf("[%d,%d]: %5.2f ", i, i,Temperature[i][i]);
} }
} printf("\n");
}
// these boundary conditions never change throughout run

// set left side to 0 and right to a linear increase

for(i = 0; i <= ROWS+1; i++) {
Temperature_last[i][0] = 0.0;
Temperature_last[i][COLUMNS+1] = (100.0/ROWS)*i;
}

// set top to 0 and bottom to linear increase

for(j = 0; j <= COLUMNS+1; j++) {
Temperature_last[0][j] = 0.0;
Temperature_last[ROWS+1][j] = (100.0/COLUMNS)*j;
}
} BCs could run from 0 to ROWS+1 or
from 1 to ROWS. We chose the
former.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

WHOLE C CODE
#include <sys/time.h>

// size of plate
#define COLUMNS 1000
#define ROWS 1000

// largest permitted change in temp (This value takes about 3400 steps)
#define MAX_TEMP_ERROR 0.01

double Temperature[ROWS+2][COLUMNS+2]; // temperature grid

double Temperature_last[ROWS+2][COLUMNS+2]; // temperature grid from last iteration gettimeofday(&stop_time,NULL);
timersub(&stop_time, &start_time, &elapsed_time); // Unix time subtract routine
// helper routines
void initialize(); printf("\nMax error at iteration %d was %f\n", iteration-1, dt);
void track_progress(int iter); printf("Total time was %f seconds.\n", elapsed_time.tv_sec+elapsed_time.tv_usec/1000000.0);
int main(int argc, char *argv[]) { }
int i, j; // grid indexes // initialize plate and boundary conditions
int max_iterations; // number of iterations // Temp_last is used to to start first iteration
int iteration=1; // current iteration void initialize(){
double dt=100; // largest change in t
struct timeval start_time, stop_time, elapsed_time; // timers int i,j;
printf("Maximum iterations [100-4000]?\n"); for(i = 0; i <= ROWS+1; i++){
scanf("%d", &max_iterations); for (j = 0; j <= COLUMNS+1; j++){
Temperature_last[i][j] = 0.0;
gettimeofday(&start_time,NULL); // Unix timer }
}
initialize(); // initialize Temp_last including boundary conditions
// these boundary conditions never change throughout run
// do until error is minimal or until max steps
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { // set left side to 0 and right to a linear increase
for(i = 0; i <= ROWS+1; i++) {
// main calculation: average my four neighbors Temperature_last[i][0] = 0.0;
for(i = 1; i <= ROWS; i++) { Temperature_last[i][COLUMNS+1] = (100.0/ROWS)*i;
for(j = 1; j <= COLUMNS; j++) { }
Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +
Temperature_last[i][j+1] + Temperature_last[i][j-1]); // set top to 0 and bottom to linear increase
} for(j = 0; j <= COLUMNS+1; j++) {
} Temperature_last[0][j] = 0.0;
Temperature_last[ROWS+1][j] = (100.0/COLUMNS)*j;
dt = 0.0; // reset largest temperature change }
}
// copy grid to old grid for next iteration and find latest dt
for(i = 1; i <= ROWS; i++){ // print diagonal in bottom right corner where most action is
for(j = 1; j <= COLUMNS; j++){ void track_progress(int iteration) {
dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);
Temperature_last[i][j] = Temperature[i][j]; int i;
}
} printf("---------- Iteration number: %d ------------\n", iteration);
for(i = ROWS-5; i <= ROWS; i++) {
// periodically print test values printf("[%d,%d]: %5.2f ", i, i, Temperature[i][i]);
if((iteration % 100) == 0) { }
track_progress(iteration); printf("\n");
} }
iteration++;
}
SERIAL FORTRAN CODE (KERNEL)
do while ( dt > max_temp_error .and. iteration <= max_iterations) Done?

do j=1,columns
do i=1,rows
temperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ & Calculate
temperature_last(i,j+1)+temperature_last(i,j-1) )
enddo
enddo

dt=0.0

do j=1,columns Update temp

do i=1,rows
array and
dt = max( abs(temperature(i,j) - temperature_last(i,j)), dt )
temperature_last(i,j) = temperature(i,j) find max
enddo change
enddo

if( mod(iteration,100).eq.0 ) then

call track_progress(temperature, iteration) Output
endif

iteration = iteration+1

enddo
SERIAL FORTRAN CODE SUBROUTINES
subroutine initialize( temperature_last ) subroutine track_progress(temperature, iteration)
implicit none implicit none

integer, parameter :: columns=1000 integer, parameter :: columns=1000

integer, parameter :: rows=1000 integer, parameter :: rows=1000
integer :: i,j integer :: i,iteration

double precision, dimension(0:rows+1,0:columns+1) :: temperature_last double precision, dimension(0:rows+1,0:columns+1) :: temperature

temperature_last = 0.0 print *, '---------- Iteration number: ', iteration, ' ---------------'
do i=5,0,-1
!these boundary conditions never change throughout run write (*,'("("i4,",",i4,"):",f6.2," ")',advance='no'), &
rows-i,columns-i,temperature(rows-i,columns-i)
!set left side to 0 and right to linear increase enddo
do i=0,rows+1 print *
temperature_last(i,0) = 0.0
temperature_last(i,columns+1) = (100.0/rows) * i
enddo

!set top to 0 and bottom to linear increase

do j=0,columns+1
temperature_last(0,j) = 0.0
temperature_last(rows+1,j) = ((100.0)/columns) * j
enddo

end subroutine initialize

program serial
implicit none

!Size of plate
WHOLE FORTRAN CODE
integer, parameter :: columns=1000
integer, parameter :: rows=1000
double precision, parameter :: max_temp_error=0.01

integer :: i, j, max_iterations, iteration=1

double precision :: dt=100.0 ! initialize plate and boundery conditions
real :: start_time, stop_time ! temp_last is used to to start first iteration
subroutine initialize( temperature_last )
double precision, dimension(0:rows+1,0:columns+1) :: temperature, temperature_last implicit none

print*, 'Maximum iterations [100-4000]?' integer, parameter :: columns=1000

read*, max_iterations integer, parameter :: rows=1000
integer :: i,j
call cpu_time(start_time) !Fortran timer
double precision, dimension(0:rows+1,0:columns+1) :: temperature_last
call initialize(temperature_last)
temperature_last = 0.0
!do until error is minimal or until maximum steps
do while ( dt > max_temp_error .and. iteration <= max_iterations) !these boundary conditions never change throughout run

do j=1,columns !set left side to 0 and right to linear increase

do i=1,rows do i=0,rows+1
temperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ & temperature_last(i,0) = 0.0
temperature_last(i,j+1)+temperature_last(i,j-1) ) temperature_last(i,columns+1) = (100.0/rows) * i
enddo enddo
enddo
!set top to 0 and bottom to linear increase
dt=0.0 do j=0,columns+1
temperature_last(0,j) = 0.0
!copy grid to old grid for next iteration and find max change temperature_last(rows+1,j) = ((100.0)/columns) * j
do j=1,columns enddo
do i=1,rows
dt = max( abs(temperature(i,j) - temperature_last(i,j)), dt ) end subroutine initialize
temperature_last(i,j) = temperature(i,j)
enddo !print diagonal in bottom corner where most action is
enddo subroutine track_progress(temperature, iteration)
implicit none
!periodically print test values
if( mod(iteration,100).eq.0 ) then integer, parameter :: columns=1000
call track_progress(temperature, iteration) integer, parameter :: rows=1000
endif integer :: i,iteration

iteration = iteration+1 double precision, dimension(0:rows+1,0:columns+1) :: temperature

enddo print *, '---------- Iteration number: ', iteration, ' ---------------'

do i=5,0,-1
call cpu_time(stop_time) write (*,'("("i4,",",i4,"):",f6.2," ")',advance='no'), &
rows-i,columns-i,temperature(rows-i,columns-i)
print*, 'Max error at iteration ', iteration-1, ' was ',dt enddo
print*, 'Total time was ',stop_time-start_time, ' seconds.' print *
end subroutine track_progress
end program serial
EXERCISES: GENERAL INSTRUCTIONS

• The exercise is found at https://nvlabs.qwiklab.com

Create an account or log in (use the same email you registered for the course with)

Open class: OpenACC Course Oct 2017

• Your objective is to add OpenACC to the serial C or Fortran code to enable it to

parallelize
• Hint: look for the significant loops and apply the one tool you have
TODAY WE DISCUSSED

• What OpenACC is used for

• What a Directive is
• The OpenACC kernels directive
• What a dependency is
• How to compile an OpenACC enabled code
NEW BOOK: OPENACC FOR PROGRAMMERS
Edited by: By Sunita Chandrasekaran, Guido Juckeland
• Discover how OpenACC makes scalable
parallel programming easier and more practical
• Get productive with OpenACC code editors,
compilers, debuggers, and performance
analysis tools
• Build your first real-world OpenACC programs
• Overcome common performance, portability,
and interoperability challenges
• Efficiently distribute tasks across multiple
processors READ MORE
Use code OPENACC during checkout for 35% off

*Discount taken off list price. Offer only good at informit.com

OPENACC FOR EVERYONE DOWNLOAD NOW

New PGI Community Edition Now Available

FREE

PROGRAMMING MODELS
OpenACC, CUDA Fortran, OpenMP,
C/C++/Fortran Compilers and Tools

PLATFORMS
X86, OpenPOWER, NVIDIA GPU

UPDATES 1-2 times a year 6-9 times a year 6-9 times a year

PGI Professional
SUPPORT User Forums PGI Support
Services
LICENSE Annual Perpetual Volume/Site
OPENACC RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow

Resources Success Stories

https://www.openacc.org/resources https://www.openacc.org/success-stories

FREE
Compilers

Compilers and Tools Events

https://www.openacc.org/tools https://www.openacc.org/events

https://www.openacc.org/community#slack
XSEDE MONTHLY WORKSHOP SERIES

• NSF Funded for the national scientific community

• Wide Area Classroom (WAC) Format

• Next OpenACC event is November 7th

• Classroom based with 25 remote sites per event

• 50 events, 8000 students and counting

Contact Tom Maiden, [email protected]

COURSE SYLLABUS:

October 19: Introduction to OpenACC

October 26: GPU Programming with OpenACC
November 2: Optimizing and Best Practices for OpenACC

Questions? Email [email protected]

Examination Hall Seating Arrangement System
No ratings yet
Examination Hall Seating Arrangement System
61 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Final Quiz 2 - Attempt Review Comprog
No ratings yet
Final Quiz 2 - Attempt Review Comprog
3 pages
Scheduling in FreeRTOS - Pre-Emptive Vs Co-Operative
No ratings yet
Scheduling in FreeRTOS - Pre-Emptive Vs Co-Operative
5 pages
(KTH Royal Institute of Technology, Enström) Pricing Collateralized Loan Obligation Tranches Using Machine Learning
No ratings yet
(KTH Royal Institute of Technology, Enström) Pricing Collateralized Loan Obligation Tranches Using Machine Learning
74 pages
SSRN 1145206
No ratings yet
SSRN 1145206
23 pages
4.2. Korobov Algorithm1
No ratings yet
4.2. Korobov Algorithm1
10 pages
Python作业2
No ratings yet
Python作业2
5 pages
Raspberry Pi :The Ultimate Step by Step Raspberry Pi User Guide (The Updated Version )
From Everand
Raspberry Pi :The Ultimate Step by Step Raspberry Pi User Guide (The Updated Version )
Jason Scotts
4/5 (4)
429 Sai Kumar
No ratings yet
429 Sai Kumar
11 pages
M110 MTA Spring22-23
No ratings yet
M110 MTA Spring22-23
3 pages
Bitwise Operations On Images in Computer Vision
No ratings yet
Bitwise Operations On Images in Computer Vision
4 pages
Frrouting Developers Guide
No ratings yet
Frrouting Developers Guide
321 pages
Batch Equipment Editor
No ratings yet
Batch Equipment Editor
273 pages
Errores CEX
No ratings yet
Errores CEX
12 pages
Sougata Jana - CSC407 - OS - 2nd - Year - CSE - SET3
No ratings yet
Sougata Jana - CSC407 - OS - 2nd - Year - CSE - SET3
2 pages
Data Analytics Certificate Glossary
No ratings yet
Data Analytics Certificate Glossary
23 pages
FX Forward Contracts
No ratings yet
FX Forward Contracts
2 pages
Attributing Return For FI Portfolios WP
No ratings yet
Attributing Return For FI Portfolios WP
33 pages
l4 Himework
No ratings yet
l4 Himework
4 pages
ICE Strategy Code Reference Manual
No ratings yet
ICE Strategy Code Reference Manual
6 pages
Am 583 Lecture 28
No ratings yet
Am 583 Lecture 28
33 pages
CreateSpace Intr Dec 2015 ISBN 1519784678
No ratings yet
CreateSpace Intr Dec 2015 ISBN 1519784678
68 pages
Kernelgen Ncar 2012 Slides
No ratings yet
Kernelgen Ncar 2012 Slides
28 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Lobeiras 2015
No ratings yet
Lobeiras 2015
12 pages
cp4292 Multicore Lab Multicore Lab
No ratings yet
cp4292 Multicore Lab Multicore Lab
41 pages
Class 12 Computer Science Sample Paper Set 1
No ratings yet
Class 12 Computer Science Sample Paper Set 1
13 pages
02 Basicarch
No ratings yet
02 Basicarch
103 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Module 4
No ratings yet
Module 4
40 pages
Digital Assignment
No ratings yet
Digital Assignment
2 pages
Online Diploma in Advanced Computing (e-DAC) : May 2021 Batch
No ratings yet
Online Diploma in Advanced Computing (e-DAC) : May 2021 Batch
45 pages
SQ L Cookbook
0% (2)
SQ L Cookbook
1 page
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Chapter 6 - Synchronization Tools - Part 2
No ratings yet
Chapter 6 - Synchronization Tools - Part 2
32 pages
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
Laporan Praktikum Algoritma Dan Pemrograman: Pertemuan Ke - 4
No ratings yet
Laporan Praktikum Algoritma Dan Pemrograman: Pertemuan Ke - 4
12 pages
GPUComputing
No ratings yet
GPUComputing
95 pages
Cp4292 Multicore Lab Multicore Lab
No ratings yet
Cp4292 Multicore Lab Multicore Lab
41 pages
KokkosTutorial ORNL20
No ratings yet
KokkosTutorial ORNL20
322 pages
DE10 PRO OpenCL Manual 19.1
No ratings yet
DE10 PRO OpenCL Manual 19.1
47 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
Navya2022 Chapter ComparativeStudyOfDirective-ba
No ratings yet
Navya2022 Chapter ComparativeStudyOfDirective-ba
13 pages
DSA MK Lect7 PDF
No ratings yet
DSA MK Lect7 PDF
77 pages
Advanced OpenACC Course Lecture2 Multi GPU 20160602
No ratings yet
Advanced OpenACC Course Lecture2 Multi GPU 20160602
91 pages
Using Python For Large Scale Linear Alge
No ratings yet
Using Python For Large Scale Linear Alge
11 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Cse570 Zola
No ratings yet
Cse570 Zola
11 pages
cp4292 Multicore Lab
No ratings yet
cp4292 Multicore Lab
41 pages
RA ARCHI DAVAO Jan2017 PDF
No ratings yet
RA ARCHI DAVAO Jan2017 PDF
5 pages
Closed-Form Arrow-Debreu Pricing For The Hull-White Short Rate Model
No ratings yet
Closed-Form Arrow-Debreu Pricing For The Hull-White Short Rate Model
10 pages
MBS South Africa
100% (1)
MBS South Africa
84 pages
Syllabus For CORE
No ratings yet
Syllabus For CORE
20 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Soacslog
No ratings yet
Soacslog
108 pages
Lab 7
No ratings yet
Lab 7
3 pages
What Is The Field Name For PO Header Text in ME23n
No ratings yet
What Is The Field Name For PO Header Text in ME23n
5 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
SG 247575
No ratings yet
SG 247575
666 pages
Survey of Primitive Money
No ratings yet
Survey of Primitive Money
416 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Final Abp Sonia Notice-Of-meeting May-2019
No ratings yet
Final Abp Sonia Notice-Of-meeting May-2019
7 pages
(ChinaBond) Introduction To The Calculation Method of ChinaBond Pricing System
No ratings yet
(ChinaBond) Introduction To The Calculation Method of ChinaBond Pricing System
15 pages
ScipyLectures Simple
100% (2)
ScipyLectures Simple
657 pages
4 Basic PLC Programming
100% (1)
4 Basic PLC Programming
30 pages
APOLLO Series 2017-1 Trust
No ratings yet
APOLLO Series 2017-1 Trust
16 pages
Perfbook-Eb 2023 06 11a
No ratings yet
Perfbook-Eb 2023 06 11a
1,432 pages
Handbook FirstEdition 20170717
No ratings yet
Handbook FirstEdition 20170717
39 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
01 ParProg20
No ratings yet
01 ParProg20
19 pages
Indian Antiquary Vol 30 1901
No ratings yet
Indian Antiquary Vol 30 1901
454 pages
An Introduction To Arab-Byzantine Coinag PDF
No ratings yet
An Introduction To Arab-Byzantine Coinag PDF
60 pages
Brazilian Swaps OpenGamma
No ratings yet
Brazilian Swaps OpenGamma
12 pages
ProgrammingModelExamples ECMWF
No ratings yet
ProgrammingModelExamples ECMWF
7 pages
A Simple Multi-Curve Model For Pricing SOFR Futures and Other Derivatives
No ratings yet
A Simple Multi-Curve Model For Pricing SOFR Futures and Other Derivatives
18 pages
CP4292-Multicore Lab
No ratings yet
CP4292-Multicore Lab
39 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Serbian and Byzantine Coinage in Serbia
No ratings yet
Serbian and Byzantine Coinage in Serbia
14 pages
Catalogue of The Japanese Coin Collection
No ratings yet
Catalogue of The Japanese Coin Collection
224 pages
C Puzzles: Dear Visitor
No ratings yet
C Puzzles: Dear Visitor
18 pages
Statement Decision Testing Coverage Istqb Foundation Exam Exercise Book Sample
No ratings yet
Statement Decision Testing Coverage Istqb Foundation Exam Exercise Book Sample
9 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
Oberlander Proceedings Glasgow 1 - Libre
No ratings yet
Oberlander Proceedings Glasgow 1 - Libre
12 pages
200510
No ratings yet
200510
18 pages
Nastich Abbasid AE Coinage of Transoxiana
No ratings yet
Nastich Abbasid AE Coinage of Transoxiana
80 pages
Database Management Systems
No ratings yet
Database Management Systems
5 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
No ratings yet
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
14 pages
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
No ratings yet
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
17 pages
Directives Tips For Fortran
No ratings yet
Directives Tips For Fortran
15 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
Web GPU
0% (1)
Web GPU
40 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
EAS 520 UmassD Syllabus Sheer
No ratings yet
EAS 520 UmassD Syllabus Sheer
2 pages
High Performance Computing: 772 10 91 Thomas@chalmers - Se
No ratings yet
High Performance Computing: 772 10 91 Thomas@chalmers - Se
75 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
A General Approach To Creating FORTRAN Interface For C++ Application Libraries
No ratings yet
A General Approach To Creating FORTRAN Interface For C++ Application Libraries
19 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages

Openacc Online Course: Lecture 1: Introduction To Openacc

Uploaded by

Openacc Online Course: Lecture 1: Introduction To Openacc

Uploaded by

OPENACC ONLINE COURSE

Lecture 1: Introduction to OpenACC

John Urbanic, Pittsburgh Supercomputing Center

Enable you to start accelerating your

October 19: Introduction to OpenACC

Questions? Email [email protected]

and accelerators for HPC.

OpenPOWER Intel OpenMP

Speedup vs Single Haswell Core

x86 Xeon Phi

ANSYS Fluent Gaussian 16

5.15X speedup 2.25X speedup

#pragma omp parallel for reduction(+:pi) #pragma acc kernels

printf(“pi = %f\n”, pi/N); printf(“pi = %f\n”, pi/N);

More on this later!

Easy to use Easy to use Most Performance

• Efficient. Experience shows very favorable comparison to low-level implementations of

• Performance portable. Supports CPUs, GPU accelerators and co-processors from

• Quick reference card also available and

Quantum Chemistry Medical Imaging Weather and Climate CFD

void saxpy(int n, subroutine saxpy(n, a, x, y)

#pragma acc kernels [clause …] !$acc kernels [clause …]

• Limits the effects of pointer aliasing

• CUDA: Hard to Maintain, OpenACC: Easy to Maintain

• CUDA: Rewrite Original Code, OpenACC: Augment Original Code

• CUDA: Optimized for Specific Hardware, OpenACC: One Source Everywhere

• CUDA: Assembler-like Programming, OpenACC: Relies on Compiler

When run on 1000 processors, it will execute something like this…

Processor 1 Processor 2 Processor 3

Added data dependency

This is a perfectly valid serial code.

Result from Processor 1

Needs the result of Processor 1’s last iteration.

If we want the correct (“same as serial”) result, we need to wait until

11, Loop carried dependence of 'Array' prevents parallelization

Loop carried backward dependence of 'Array' prevents vectorization

if((iteration % 100) == 0) { Output

int i,j; int i;

for(i = 0; i <= ROWS+1; i++){ printf("-- Iteration: %d --\n", iteration);

// set left side to 0 and right to a linear increase

// set top to 0 and bottom to linear increase

double Temperature[ROWS+2][COLUMNS+2]; // temperature grid

do j=1,columns Update temp

if( mod(iteration,100).eq.0 ) then

integer, parameter :: columns=1000 integer, parameter :: columns=1000

double precision, dimension(0:rows+1,0:columns+1) :: temperature_last double precision, dimension(0:rows+1,0:columns+1) :: temperature

!set top to 0 and bottom to linear increase

end subroutine initialize

integer :: i, j, max_iterations, iteration=1

print*, 'Maximum iterations [100-4000]?' integer, parameter :: columns=1000

do j=1,columns !set left side to 0 and right to linear increase

iteration = iteration+1 double precision, dimension(0:rows+1,0:columns+1) :: temperature

enddo print *, '---------- Iteration number: ', iteration, ' ---------------'

• The exercise is found at https://nvlabs.qwiklab.com

Open class: OpenACC Course Oct 2017

• Your objective is to add OpenACC to the serial C or Fortran code to enable it to

• What OpenACC is used for

*Discount taken off list price. Offer only good at informit.com

New PGI Community Edition Now Available

Resources Success Stories

Compilers and Tools Events

• NSF Funded for the national scientific community

• Wide Area Classroom (WAC) Format

• Next OpenACC event is November 7th

• Classroom based with 25 remote sites per event

• 50 events, 8000 students and counting

Contact Tom Maiden, [email protected]

October 19: Introduction to OpenACC

Questions? Email [email protected]

You might also like