0% found this document useful (0 votes)
107 views47 pages

Openacc Online Course: Lecture 1: Introduction To Openacc

The document provides an introduction to the OpenACC online course, including: - The course objective is to enable attendees to start accelerating applications with OpenACC. - The course syllabus covers an introduction to OpenACC on October 19th and GPU programming and optimizing with OpenACC on subsequent dates. - OpenACC is a directive-based approach for performance and portability on CPUs and accelerators using compiler directives to generate parallel code.

Uploaded by

Quant_Geek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views47 pages

Openacc Online Course: Lecture 1: Introduction To Openacc

The document provides an introduction to the OpenACC online course, including: - The course objective is to enable attendees to start accelerating applications with OpenACC. - The course syllabus covers an introduction to OpenACC on October 19th and GPU programming and optimizing with OpenACC on subsequent dates. - OpenACC is a directive-based approach for performance and portability on CPUs and accelerators using compiler directives to generate parallel code.

Uploaded by

Quant_Geek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

OPENACC ONLINE COURSE

Lecture 1: Introduction to OpenACC

John Urbanic, Pittsburgh Supercomputing Center


October 19, 2017
COURSE OBJECTIVE:

Enable you to start accelerating your


applications with OpenACC.
COURSE SYLLABUS:

October 19: Introduction to OpenACC


October 26: GPU Programming with OpenACC
November 2: Optimizing and Best Practices for OpenACC

Questions? Email [email protected]


WHAT IS OPENACC
OpenACC is a directives- Add Simple Compiler Directive
based programming approach
main()
to parallel computing {
<serial code>
#pragma acc kernels
designed for performance {
<parallel code>
}
and portability on CPUs }

and accelerators for HPC.


OPENACC DIRECTIVES
Parallel Hardware
CPU
1. Simple compiler hints from programmer
2. Compiler generates parallel threaded code
3. Ignorant compiler just sees some comments.

int main(){

<sequential code>
Incremental
Compiler
#pragma acc kernels Directive
Single Source
{ Low Learning
<parallel code> Curve
}

}
More on this later!
SINGLE CODE FOR MULTIPLE PLATFORMS
OpenACC - Performance Portable Programming Model for HPC
AWE Hydrodynamics CloverLeaf mini-App, bm32 data set
80x 77x
PGI OpenACC

OpenPOWER Intel OpenMP

Speedup vs Single Haswell Core


60x IBM OpenMP
52x
Sunway
x86 CPU 40x

x86 Xeon Phi


20x
NVIDIA GPU 9x 10x 10x 11x 11x
9x

PEZY-SC
0x Dual Haswell 1 Tesla 1 Tesla
Dual Broadwell Dual POWER8
P100 V100

Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Minsky: POWER8+NVLINK, four P100s,
RHEL 7.3 (gsn1).
Compilers: Intel 17.0, IBM XL 13.1.3, PGI 16.10, KNL: Compiler version: 17.0.1 20161005,
Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC
(MPI+OpenACC)
Data compiled by PGI November 2016, Volta data collected June 2017
TOP HPC APPS ADOPTING OPENACC
ANSYS Fluent ● Gaussian ● VASP ● GTC ● XGC ● ACME ● FLASH ● LSDalton ● COSMO ● ELEPHANT ● RAMSES ● ICON ● ORB5

ANSYS Fluent Gaussian 16


ANSYS Fluent R18.0 Radiation Solver Valinomycin wB97xD/6-311+(2d,p) Freq
30000
Fluent Native Solver
Fluent HTC Solver K80 GPU
22500

5.15X speedup 2.25X speedup


Time
(S)

15000

7500

0
T4 T8 T14 T28
CPU
(cores)
Hardware: HPE server with dual Intel Xeon E5-2698 v3 CPUs (2.30GHz ; 16 cores/chip),
256GB memory and 4 Tesla K80 dual GPU boards (boost clocks: MEM 2505 and SM 875).
CPU: (Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @2.30GHz, 2 sockets, 28 cores Gaussian source code compiled with PGI Accelerator Compilers (16.5) with OpenACC (2.5
GPU: Tesla K80 12+12 GB, Driver 346.46 standard).
FAMILIAR TO OPENMP PROGRAMMERS
CPU CPU Parallel Hardware

main() {
double pi = 0.0; long i; main() {
double pi = 0.0; long i;

#pragma omp parallel for reduction(+:pi) #pragma acc kernels


for (i=0; i<N; i++) for (i=0; i<N; i++)
{ {
double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t);
} }

printf(“pi = %f\n”, pi/N); printf(“pi = %f\n”, pi/N);


} }

More on this later!


3 WAYS TO ACCELERATE
APPLICATIONS

Applications

Compiler Programming
Libraries
Directives Languages

Easy to use Easy to use Most Performance


Most Performance Portable code Most Flexibility
CUDA, OpenCL
OPENACC: KEY ADVANTAGES
• High-level. Minimal modifications to the code. Less than with OpenCL, CUDA, etc. Non-
GPU programmers can play along.

• Single source. No GPU-specific code. Compile the same program for accelerators or
serial.

• Efficient. Experience shows very favorable comparison to low-level implementations of


same algorithms.

• Performance portable. Supports CPUs, GPU accelerators and co-processors from


multiple vendors, current and future versions.

• Incremental. Developers can port and tune parts of their application as resources and
profiling dictates. No wholesale rewrite required. Which can be quick.
TRUE OPEN STANDARD
• Full OpenACC 1.0 and 2.0 and now 2.5
specifications available at OpenACC.org Members
http://www.openacc.org

• Quick reference card also available and


useful: https://www.openacc.org/resources
• Implementations available now from PGI,
Cray, and GCC: https://www.openacc.org/tools
• GCC version of OpenACC in 5.x and 6.x,
but use 7.x: https://www.openacc.org/tools
• Best free option is very probably PGI
Community Edition:
http://www.pgroup.com/products/community.htm
LSDalton PowerGrid COSMO INCOMP3D

Quantum Chemistry Medical Imaging Weather and Climate CFD


Aarhus University University of Illinois MeteoSwiss, CSCS NC State University
12X speedup 40 days to 4X speedup
1 week 2 hours 3X energy efficiency 4X speedup

MAESTRO
NekCEM CASTRO CloverLeaf FINE/Turbo

CFD
Comp Electromagnetics Astrophysics Comp Hydrodynamics NUMECA
Argonne National Lab Stony Brook University AWE International
2.5X speedup 4.4X speedup 4X speedup 10X faster routines
60% less energy 4 weeks effort Single CPU/GPU code 2X faster app
OPENACC DIRECTIVES
A SIMPLE EXAMPLE: SAXPY
SAXPY in C SAXPY in Fortran

void saxpy(int n, subroutine saxpy(n, a, x, y)


float a, real :: x(:), y(:), a
float *x, integer :: n, i
float *restrict y) !$acc kernels
do i=1,n
{
y(i) = a*x(i)+y(i)
#pragma acc kernels
enddo
for (int i = 0; i < n; ++i) !$acc end kernels
y[i] = a*x[i] + y[i]; end subroutine saxpy
}

... ...
// Somewhere in main $ From main program
// call SAXPY on 1M elements $ call SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y); call saxpy(2**20, 2.0, x_d, y_d)
...
...
KERNELS: OUR FIRST OPENACC DIRECTIVE
We request that each loop execute as a separate kernel on the GPU. This is an
incredibly powerful directive.

!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0 Kernel:
kernel 1 A parallel routine to run
c(i) = 2.0
end do on the parallel
hardware
do i=1,n
a(i) = b(i) + c(i)
end do kernel 2
!$acc end kernels
GENERAL DIRECTIVE SYNTAX AND SCOPE

C Fortran

#pragma acc kernels [clause …] !$acc kernels [clause …]


{ structured block
structured block !$acc end kernels
}

I may indent the directives at the natural code indentation level for readability. It is a
common practice to always start them in the first column (ala #define/#ifdef). Either is
fine with C or Fortran 90 compilers.
COMPLETE SAXPY EXAMPLE CODE
int main(int argc, char **argv) “I promise y is not
{ aliased by
int N = 1<<20; // 1 million floats Anything else (esp. x)”
#include <stdlib.h>
if (argc > 1)
N = atoi(argv[1]);
void saxpy(int n,
float a,
float *x = (float*)malloc(N * sizeof(float));
float *x,
float *y = (float*)malloc(N * sizeof(float));
for (int i = 0; i < N; ++i) { float *restrict y)
x[i] = 2.0f; {
y[i] = 1.0f; #pragma acc kernels
} for (int i = 0; i < n; ++i)
saxpy(N, 3.0f, x, y); y[i] = a * x[i] + y[i];
}
return 0;
}
C DETAIL: THE “RESTRICT” KEYWORD
• Standard C (as of C99).
• Important for optimization of serial as well as OpenACC and OpenMP code.
• Promise given by the programmer to the compiler for a pointer: float *restrict ptr
Meaning: “for the lifetime of ptr, only it or a value directly derived from it (such as ptr + 1) will be used to access the object to
which it points”

• Limits the effects of pointer aliasing


• OpenACC compilers often require restrict to determine independence
• Otherwise the compiler can’t parallelize loops that access ptr
• Note: if programmer violates the declaration, behavior is undefined
COMPILE AND RUN
• C: pgcc –acc saxpy.c -acc will enable OpenACC directives
Default here targets serial CPU and GPU
• Fortran: pgf90 –acc saxpy.f90

• Run: a.out
Compiler Output
pgcc -acc -Minfo=accel -ta=tesla saxpy.c -ta=tesla will only target a GPU
-ta=multicore will only target a multicore CPU
saxpy: -Minfo=accel turns on helpful compiler reporting
8, Generating copyin(x[:n-1])
Generating copy(y[:n-1])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
9, Loop is parallelizable
Accelerator kernel generated
9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */
CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy
COMPARE: OPENACC AND CUDA IMPLEMENTATIONS
OpenACC: CUDA:
Complete SAXPY Example Code Partial CUDA C SAXPY Code SAXPY Example Code
#include <stdlib.h>
__global__ void saxpy_kernel( float a, module kmod
void saxpy(int n, float* x, float* y, int n ){ use cudafor
float a, int i; contains
attributes(global) subroutine
float *x, i = blockIdx.x*blockDim.x +
saxpy_kernel(A,X,Y,N)
float *restrict y) threadIdx.x; real(4), device :: A, X(N), Y(N)
{ if( i <= n ) x[i] = a*x[i] + y[i]; integer, value :: N
#pragma acc kernels } integer :: i
for (int i = 0; i < n; ++i) void saxpy( float a, float* x, float* y, i = (blockidx%x-1)*blockdim%x + threadidx%x
y[i] = a * x[i] + y[i]; int n ){ if( i <= N ) X(i) = A*X(i) + Y(i)
} float *xd, *yd; end subroutine
cudaMalloc( (void**)&xd, end module
int main(int argc, char **argv) n*sizeof(float) );
{ cudaMalloc( (void**)&yd, subroutine saxpy( A, X, Y, N )
int N = 1<<20; // 1 million floats n*sizeof(float) ); cudaMemcpy( xd, x, use kmod
if (argc > 1) n*sizeof(float), real(4) :: A, X(N), Y(N)
N = atoi(argv[1]); integer :: N
cudaMemcpyHostToDevice ); real(4), device, allocatable, dimension(:)::
float *x = (float*)malloc(N * sizeof(float)); cudaMemcpy( yd, y, n*sizeof(float), &
Xd, Yd
float *y = (float*)malloc(N * sizeof(float));
allocate( Xd(N), Yd(N) )
for (int i = 0; i < N; ++i) { cudaMemcpyHostToDevice );
Xd = X(1:N)
x[i] = 2.0f; saxpy_kernel<<< (n+31)/32, 32 >>>( a, Yd = Y(1:N)
y[i] = 1.0f; xd, yd, n ); call saxpy_kernel<<<(N+31)/32,32>>>(A, Xd,
} cudaMemcpy( x, xd, n*sizeof(float), Yd, N)
saxpy(N, 3.0f, x, y); X(1:N) = Xd
cudaMemcpyDeviceToHost ); deallocate( Xd, Yd )
cudaFree( xd ); cudaFree( yd ); end subroutine
return 0;
} }
BIG DIFFERENCE!
OpenACC vs CUDA implementations

• CUDA: Hard to Maintain, OpenACC: Easy to Maintain


With CUDA, we changed the structure of the old code. Non-CUDA programmers
can’t understand new code. It is not even ANSI standard code.

• CUDA: Rewrite Original Code, OpenACC: Augment Original Code


We have separate sections for the host code and the GPU code. Different flow of
code. Serial path now gone forever.

• CUDA: Optimized for Specific Hardware, OpenACC: One Source Everywhere


Where did these “32”s and other mystery numbers come from? This is a clue that we
have some hardware details to deal with here.

• CUDA: Assembler-like Programming, OpenACC: Relies on Compiler


Exact same situation as assembly used to be. How much hand-assembled code is
still being written in HPC now that compilers have gotten so efficient?
THIS LOOKS EASY! TOO EASY…
Questions:
1. If it is this simple, why don’t we just throw kernels in front of every loop?
2. Better yet, why doesn’t the compiler do this for me?
Answers:
There are two general issues that prevent the compiler from being able to just
automatically parallelize every loop:
1. Data Dependencies in Loops
2. Data Movement
The compiler needs your higher level perspective (in the form of directive hints) to get
correct results and reasonable performance
DATA DEPENDENCIES
DATA DEPENDENCIES
Most directive based parallelization consists of splitting up big do/for loops into
independent chunks that the many processors can work on simultaneously.
Take, for example, a simple for loop like this:

for(index=0, index<1000000,index++)
Array[index] = 4 * Array[index];

When run on 1000 processors, it will execute something like this…


NO DATA DEPENDENCIES
A run on 1000 processors for the loop below

for(index=0, index<1000000,index++)
Array[index] = 4 * Array[index];

Processor 1 Processor 2 Processor 3


for(index=0, index<999,index++) for(index=1000, index<1999,index++) for(index=2000, index<2999,index++)

….
Array[index] = 4*Array[index]; Array[index] = 4*Array[index]; Array[index] = 4*Array[index];

Processor 4 Processor 5
for(index=3000, index<3999,index++) for(index=4000, index<4999,index++)
Array[index] = 4*Array[index]; Array[index] = 4*Array[index];
WITH DATA DEPENDENCIES
But what if the loops are not entirely independent?
Take, for example, a similar loop like this:

for(index=1, index<1000000,index++)
Array[index] = 4 * Array[index] – Array[index-1];

Added data dependency

This is a perfectly valid serial code.


WITH DATA DEPENDENCIES

Processor 1
for(index=0, index<999,index++)
Array[index] = 4*Array[index]-
Array[index-1];
Processor 2
for(index=1000, index<1999,index++)
Array[index] = 4*Array[index]-
….
Array[index-1];

Result from Processor 1


for(index=1000, index<1999,index++)
Array[1000] = 4 * Array[1000] – Array[999];

Needs the result of Processor 1’s last iteration.

If we want the correct (“same as serial”) result, we need to wait until


processor 1 finishes. Likewise for processors 3, 4, …
DATA DEPENDENCIES
If the compiler even suspects that there is a data dependency, it will, for the sake of
correctness, refuse to parallelize that loop.

11, Loop carried dependence of 'Array' prevents parallelization

Loop carried backward dependence of 'Array' prevents vectorization

As large, complex loops are quite common in HPC, especially around the most
important parts of your code, the compiler will often balk most when you most need
a kernel to be generated. What can you do?
HOW TO MANAGE DATA DEPENDENCIES
• Rearrange your code to make it more obvious to the compiler that there is not
really a data dependency.
• Eliminate a real dependency by changing your code.
• There is a common bag of tricks developed for this as this issue goes back 40
years in HPC. Many are quite trivial to apply.
• The compilers have gradually been learning these themselves.
• Override the compiler’s judgment (independent clause) at the risk of invalid
results. Misuse of restrict has similar consequences.
EXERCISES
FOUNDATION EXERCISE: LAPLACE SOLVER
• I’ve been using this for MPI, OpenMP and now OpenACC. It is a great simulation
problem, not rigged for OpenACC.
• In this most basic form, it solves the Laplace equation: 𝛁𝟐 𝒇(𝒙, 𝒚) = 𝟎
• The Laplace Equation applies to many physical problems, including: electrostatics,
fluid flow, and temperature
• For temperature, it is the Steady State Heat Equation:
Initial Conditions Final Steady State

Metal Metal
Plate Plate

Heating
Element
EXERCISE FOUNDATION: JACOBI ITERATION
• The Laplace equation on a grid states that each grid point is the average of it’s
neighbors.
• We can iteratively converge to that state by repeatedly computing new values at
each point from the average of neighboring points.
• We just keep doing this until the difference from one pass to the next is small
enough for us to tolerate.

A(i,j+1)

𝐴𝑘 (𝑖 − 1, 𝑗) + 𝐴𝑘 𝑖 + 1, 𝑗 + 𝐴𝑘 𝑖, 𝑗 − 1 + 𝐴𝑘 𝑖, 𝑗 + 1
𝐴𝑘+1 𝑖, 𝑗 =
A(i-1,j) A(i,j) A(i+1,j) 4

A(i,j-1)
SERIAL CODE IMPLEMENTATION
C
for(i = 1; i <= ROWS; i++) {
for(j = 1; j <= COLUMNS; j++) {
Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +
Temperature_last[i][j+1] + Temperature_last[i][j-1]);
}
}

Fortran
do j=1,columns
do i=1,rows
temperature(i,j)= 0.25 * (temperature_last(i+1,j)+temperature_last(i-1,j) + &
temperature_last(i,j+1)+temperature_last(i,j-1) )
enddo
enddo
SERIAL C CODE (KERNEL)
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { Done?
for(i = 1; i <= ROWS; i++) {
for(j = 1; j <= COLUMNS; j++) {
Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] + Calculate
Temperature_last[i][j+1] + Temperature_last[i][j-1]);
}
}

dt = 0.0;
Update
for(i = 1; i <= ROWS; i++){ temp
for(j = 1; j <= COLUMNS; j++){
dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);
array and
Temperature_last[i][j] = Temperature[i][j]; find max
} change
}

if((iteration % 100) == 0) { Output


track_progress(iteration);
}

iteration++;

}
SERIAL C CODE SUBROUTINES
void initialize(){ void track_progress(int iteration) {

int i,j; int i;

for(i = 0; i <= ROWS+1; i++){ printf("-- Iteration: %d --\n", iteration);


for (j = 0; j <= COLUMNS+1; j++){ for(i = ROWS-5; i <= ROWS; i++) {
Temperature_last[i][j] = 0.0; printf("[%d,%d]: %5.2f ", i, i,Temperature[i][i]);
} }
} printf("\n");
}
// these boundary conditions never change throughout run

// set left side to 0 and right to a linear increase


for(i = 0; i <= ROWS+1; i++) {
Temperature_last[i][0] = 0.0;
Temperature_last[i][COLUMNS+1] = (100.0/ROWS)*i;
}

// set top to 0 and bottom to linear increase


for(j = 0; j <= COLUMNS+1; j++) {
Temperature_last[0][j] = 0.0;
Temperature_last[ROWS+1][j] = (100.0/COLUMNS)*j;
}
} BCs could run from 0 to ROWS+1 or
from 1 to ROWS. We chose the
former.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

WHOLE C CODE
#include <sys/time.h>

// size of plate
#define COLUMNS 1000
#define ROWS 1000

// largest permitted change in temp (This value takes about 3400 steps)
#define MAX_TEMP_ERROR 0.01

double Temperature[ROWS+2][COLUMNS+2]; // temperature grid


double Temperature_last[ROWS+2][COLUMNS+2]; // temperature grid from last iteration gettimeofday(&stop_time,NULL);
timersub(&stop_time, &start_time, &elapsed_time); // Unix time subtract routine
// helper routines
void initialize(); printf("\nMax error at iteration %d was %f\n", iteration-1, dt);
void track_progress(int iter); printf("Total time was %f seconds.\n", elapsed_time.tv_sec+elapsed_time.tv_usec/1000000.0);
int main(int argc, char *argv[]) { }
int i, j; // grid indexes // initialize plate and boundary conditions
int max_iterations; // number of iterations // Temp_last is used to to start first iteration
int iteration=1; // current iteration void initialize(){
double dt=100; // largest change in t
struct timeval start_time, stop_time, elapsed_time; // timers int i,j;
printf("Maximum iterations [100-4000]?\n"); for(i = 0; i <= ROWS+1; i++){
scanf("%d", &max_iterations); for (j = 0; j <= COLUMNS+1; j++){
Temperature_last[i][j] = 0.0;
gettimeofday(&start_time,NULL); // Unix timer }
}
initialize(); // initialize Temp_last including boundary conditions
// these boundary conditions never change throughout run
// do until error is minimal or until max steps
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { // set left side to 0 and right to a linear increase
for(i = 0; i <= ROWS+1; i++) {
// main calculation: average my four neighbors Temperature_last[i][0] = 0.0;
for(i = 1; i <= ROWS; i++) { Temperature_last[i][COLUMNS+1] = (100.0/ROWS)*i;
for(j = 1; j <= COLUMNS; j++) { }
Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +
Temperature_last[i][j+1] + Temperature_last[i][j-1]); // set top to 0 and bottom to linear increase
} for(j = 0; j <= COLUMNS+1; j++) {
} Temperature_last[0][j] = 0.0;
Temperature_last[ROWS+1][j] = (100.0/COLUMNS)*j;
dt = 0.0; // reset largest temperature change }
}
// copy grid to old grid for next iteration and find latest dt
for(i = 1; i <= ROWS; i++){ // print diagonal in bottom right corner where most action is
for(j = 1; j <= COLUMNS; j++){ void track_progress(int iteration) {
dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);
Temperature_last[i][j] = Temperature[i][j]; int i;
}
} printf("---------- Iteration number: %d ------------\n", iteration);
for(i = ROWS-5; i <= ROWS; i++) {
// periodically print test values printf("[%d,%d]: %5.2f ", i, i, Temperature[i][i]);
if((iteration % 100) == 0) { }
track_progress(iteration); printf("\n");
} }
iteration++;
}
SERIAL FORTRAN CODE (KERNEL)
do while ( dt > max_temp_error .and. iteration <= max_iterations) Done?

do j=1,columns
do i=1,rows
temperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ & Calculate
temperature_last(i,j+1)+temperature_last(i,j-1) )
enddo
enddo

dt=0.0

do j=1,columns Update temp


do i=1,rows
array and
dt = max( abs(temperature(i,j) - temperature_last(i,j)), dt )
temperature_last(i,j) = temperature(i,j) find max
enddo change
enddo

if( mod(iteration,100).eq.0 ) then


call track_progress(temperature, iteration) Output
endif

iteration = iteration+1

enddo
SERIAL FORTRAN CODE SUBROUTINES
subroutine initialize( temperature_last ) subroutine track_progress(temperature, iteration)
implicit none implicit none

integer, parameter :: columns=1000 integer, parameter :: columns=1000


integer, parameter :: rows=1000 integer, parameter :: rows=1000
integer :: i,j integer :: i,iteration

double precision, dimension(0:rows+1,0:columns+1) :: temperature_last double precision, dimension(0:rows+1,0:columns+1) :: temperature

temperature_last = 0.0 print *, '---------- Iteration number: ', iteration, ' ---------------'
do i=5,0,-1
!these boundary conditions never change throughout run write (*,'("("i4,",",i4,"):",f6.2," ")',advance='no'), &
rows-i,columns-i,temperature(rows-i,columns-i)
!set left side to 0 and right to linear increase enddo
do i=0,rows+1 print *
temperature_last(i,0) = 0.0
temperature_last(i,columns+1) = (100.0/rows) * i
enddo

!set top to 0 and bottom to linear increase


do j=0,columns+1
temperature_last(0,j) = 0.0
temperature_last(rows+1,j) = ((100.0)/columns) * j
enddo

end subroutine initialize


program serial
implicit none

!Size of plate
WHOLE FORTRAN CODE
integer, parameter :: columns=1000
integer, parameter :: rows=1000
double precision, parameter :: max_temp_error=0.01

integer :: i, j, max_iterations, iteration=1


double precision :: dt=100.0 ! initialize plate and boundery conditions
real :: start_time, stop_time ! temp_last is used to to start first iteration
subroutine initialize( temperature_last )
double precision, dimension(0:rows+1,0:columns+1) :: temperature, temperature_last implicit none

print*, 'Maximum iterations [100-4000]?' integer, parameter :: columns=1000


read*, max_iterations integer, parameter :: rows=1000
integer :: i,j
call cpu_time(start_time) !Fortran timer
double precision, dimension(0:rows+1,0:columns+1) :: temperature_last
call initialize(temperature_last)
temperature_last = 0.0
!do until error is minimal or until maximum steps
do while ( dt > max_temp_error .and. iteration <= max_iterations) !these boundary conditions never change throughout run

do j=1,columns !set left side to 0 and right to linear increase


do i=1,rows do i=0,rows+1
temperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ & temperature_last(i,0) = 0.0
temperature_last(i,j+1)+temperature_last(i,j-1) ) temperature_last(i,columns+1) = (100.0/rows) * i
enddo enddo
enddo
!set top to 0 and bottom to linear increase
dt=0.0 do j=0,columns+1
temperature_last(0,j) = 0.0
!copy grid to old grid for next iteration and find max change temperature_last(rows+1,j) = ((100.0)/columns) * j
do j=1,columns enddo
do i=1,rows
dt = max( abs(temperature(i,j) - temperature_last(i,j)), dt ) end subroutine initialize
temperature_last(i,j) = temperature(i,j)
enddo !print diagonal in bottom corner where most action is
enddo subroutine track_progress(temperature, iteration)
implicit none
!periodically print test values
if( mod(iteration,100).eq.0 ) then integer, parameter :: columns=1000
call track_progress(temperature, iteration) integer, parameter :: rows=1000
endif integer :: i,iteration

iteration = iteration+1 double precision, dimension(0:rows+1,0:columns+1) :: temperature

enddo print *, '---------- Iteration number: ', iteration, ' ---------------'


do i=5,0,-1
call cpu_time(stop_time) write (*,'("("i4,",",i4,"):",f6.2," ")',advance='no'), &
rows-i,columns-i,temperature(rows-i,columns-i)
print*, 'Max error at iteration ', iteration-1, ' was ',dt enddo
print*, 'Total time was ',stop_time-start_time, ' seconds.' print *
end subroutine track_progress
end program serial
EXERCISES: GENERAL INSTRUCTIONS

• The exercise is found at https://nvlabs.qwiklab.com


Create an account or log in (use the same email you registered for the course with)

Open class: OpenACC Course Oct 2017

• Your objective is to add OpenACC to the serial C or Fortran code to enable it to


parallelize
• Hint: look for the significant loops and apply the one tool you have
TODAY WE DISCUSSED

• What OpenACC is used for


• What a Directive is
• The OpenACC kernels directive
• What a dependency is
• How to compile an OpenACC enabled code
NEW BOOK: OPENACC FOR PROGRAMMERS
Edited by: By Sunita Chandrasekaran, Guido Juckeland
• Discover how OpenACC makes scalable
parallel programming easier and more practical
• Get productive with OpenACC code editors,
compilers, debuggers, and performance
analysis tools
• Build your first real-world OpenACC programs
• Overcome common performance, portability,
and interoperability challenges
• Efficiently distribute tasks across multiple
processors READ MORE
Use code OPENACC during checkout for 35% off

*Discount taken off list price. Offer only good at informit.com


OPENACC FOR EVERYONE DOWNLOAD NOW

New PGI Community Edition Now Available


FREE

PROGRAMMING MODELS
OpenACC, CUDA Fortran, OpenMP,
C/C++/Fortran Compilers and Tools

PLATFORMS
X86, OpenPOWER, NVIDIA GPU

UPDATES 1-2 times a year 6-9 times a year 6-9 times a year

PGI Professional
SUPPORT User Forums PGI Support
Services
LICENSE Annual Perpetual Volume/Site
OPENACC RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow

Resources Success Stories


https://www.openacc.org/resources https://www.openacc.org/success-stories

FREE
Compilers

Compilers and Tools Events


https://www.openacc.org/tools https://www.openacc.org/events

https://www.openacc.org/community#slack
XSEDE MONTHLY WORKSHOP SERIES

• NSF Funded for the national scientific community

• Wide Area Classroom (WAC) Format

• Next OpenACC event is November 7th

• Classroom based with 25 remote sites per event

• 50 events, 8000 students and counting

Contact Tom Maiden, [email protected]


COURSE SYLLABUS:

October 19: Introduction to OpenACC


October 26: GPU Programming with OpenACC
November 2: Optimizing and Best Practices for OpenACC

Questions? Email [email protected]

You might also like