0% found this document useful (0 votes)
13 views

Module5

Uploaded by

singhguma86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Module5

Uploaded by

singhguma86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

MODULE FIVE:

DATA MANAGEMENT
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
OpenACC Data Management

 Explicit Data Management


 OpenACC Data Regions and Clauses
 Unstructured Data Lifetimes
 Data Synchronization
EXPLICIT MEMORY MANAGEMENT
EXPLICIT MEMORY MANAGEMENT
Requirements

 Data must be visible on the device when Host


we run our parallel code
Device
 Data must be visible on the host when we
run our sequential code
 When the host and device don’t share
memory, data movement must occur
 To maximize performance, the Host
programmer should avoid all unnecessary Memory
data transfers Device
Memory
EXPLICIT MEMORY MANAGEMENT
Key problems

 Many parallel accelerators (such as Host


devices) have a separate memory space
from the host Device
 These separate memories can become
out-of-sync and contain completely
different data
 Transferring between these two memories
can be a very time consuming process Host
Memory
Device
Memory
OPENACC DATA DIRECTIVE
OPENACC DATA DIRECTIVE
Definition

 The data directive defines a lifetime #pragma acc data clauses


for data on the device {
 During the region data should be < Sequential and/or Parallel code >
thought of as residing on the
accelerator }
 Data clauses allow the programmer
to control the allocation and !$acc data clauses
movement of data
< Sequential and/or Parallel code >

!$acc end data


DATA CLAUSES
copy( list ) Allocates memory on device and copies data from host to device
when entering region and copies data to the host when exiting region.

Principal use: For many important data structures in your code, this is a
logical default to input, modify and return the data.

copyin( list ) Allocates memory on device and copies data from host to device
when entering region.

Principal use: Think of this like an array that you would use as just an
input to a subroutine.

copyout( list ) Allocates memory on device and copies data to the host when exiting
region.

Principal use: A result that isn’t overwriting the input data structure.

create( list ) Allocates memory on device but does not copy.

Principal use: Temporary arrays.


IMPLIED DATA REGIONS
IMPLIED DATA REGIONS
Definition
 Every kernels and parallel region has
an implicit data region surrounding it
 This allows data to exist solely for the
#pragma acc kernels copyin(a[0:100])
duration of the region {
for( int i = 0; i < 100; i++ )
 All data clauses usable on a data {
directive can be used on a parallel and a[i] = 0;
kernels as well }
}
IMPLIED DATA REGIONS
Explicit vs Implicit Data Regions

Explicit Implicit
#pragma acc data copyin(a[0:100])
{
#pragma acc kernels #pragma acc kernels copyin(a[0:100])
{ {
for( int i = 0; i < 100; i++ ) for( int i = 0; i < 100; i++ )
{ {
a[i] = 0; a[i] = 0;
} }
} }
}

These two codes are functionally the same.


EXPLICIT VS. IMPLICIT DATA REGIONS
Limitation

Explicit 1 Data Copy Implicit 2 Data Copies


#pragma acc data copyout(a[0:100])
{

#pragma acc kernels #pragma acc kernels copyout(a[0:100])


{ {
a[i] = i; a[i] = i;
} }

#pragma acc kernels #pragma acc kernels copy(a[0:100])


{ {
a[i] = 2 * a[i]; a[i] = 2 * a[i];
} }

The code on the left will perform better than the code on the right.
UNSTRUCTURED DATA DIRECTIVES
UNSTRUCTURED DATA DIRECTIVES
Enter Data Directive
 Data lifetimes aren’t always neatly
structured. #pragma acc enter data clauses

 The enter data directive handles device < Sequential and/or Parallel code >
memory allocation
#pragma acc exit data clauses
 You may use either the create or the
copyin clause for memory allocation
 The enter data directive is not the start !$acc enter data clauses
of a data region, because you may
have multiple enter data directives < Sequential and/or Parallel code >

!$acc exit data clauses


UNSTRUCTURED DATA DIRECTIVES
Exit Data Directive
 The exit data directive handles device
memory deallocation #pragma acc enter data clauses
 You may use either the delete or the < Sequential and/or Parallel code >
copyout clause for memory deallocation
#pragma acc exit data clauses
 You should have as many exit data for a
given array as enter data
 These can exist in different functions !$acc enter data clauses

< Sequential and/or Parallel code >

!$acc exit data clauses


UNSTRUCTURED DATA CLAUSES

copyin ( list ) Allocates memory on device and copies data from host to device
on enter data.
copyout ( list ) Allocates memory on device and copies data back to the host on
exit data.
create ( list ) Allocates memory on device without data transfer on enter data.
delete ( list ) Deallocates memory on device without data transfer on exit data
UNSTRUCTURED DATA DIRECTIVES
Basic Example

#pragma acc parallel loop


for(int i = 0; i < N; i++){
c[i] = a[i] + b[i];
}
UNSTRUCTURED DATA DIRECTIVES
Basic Example

#pragma acc enter data copyin(a[0:N],b[0:N]) create(c[0:N])

#pragma acc parallel loop


for(int i = 0; i < N; i++){
c[i] = a[i] + b[i];
}

#pragma acc exit data copyout(c[0:N])


UNSTRUCTURED DATA DIRECTIVES
Basic Example

#pragma acc enter data copyin(a[0:N],b[0:N]) create(c[0:N])


Action
#pragma acc parallel loop
for(int i = 0; i < N; i++){ Copy C B
A
Execute
Deallocate
Allocateloop
C
B
AC
c[i] = a[i] + b[i]; from
from
on
} device
CPU toto
device
device
CPU
#pragma acc exit data copyout(c[0:N])
CPU MEMORY device MEMORY

C’ A B C’
A B C
UNSTRUCTURED DATA DIRECTIVES
Basic Example – proper memory deallocation

#pragma acc enter data copyin(a[0:N],b[0:N]) create(c[0:N])


Action
#pragma acc parallel loop
for(int i = 0; i < N; i++){
Deallocate A
B
c[i] = a[i] + b[i]; from
} device

#pragma acc exit data copyout(c[0:N]) delete(a,b)


CPU MEMORY device MEMORY

C’ A B
A B C
UNSTRUCTURED VS STRUCTURED
With a simple code
Unstructured Structured
 Can have multiple starting/ending points  Must have explicit start/end points
 Can branch across multiple functions  Must be within a single function
 Memory exists until explicitly deallocated  Memory only exists within the data region
#pragma acc enter data copyin(a[0:N],b[0:N]) \ #pragma acc data copyin(a[0:N],b[0:N]) \
create(c[0:N]) copyout(c[0:N])
{
#pragma acc parallel loop #pragma acc parallel loop
for(int i = 0; i < N; i++){ for(int i = 0; i < N; i++){
c[i] = a[i] + b[i]; c[i] = a[i] + b[i];
} }

#pragma acc exit data copyout(c[0:N]) \ }


delete(a,b)
UNSTRUCTURED DATA DIRECTIVES
Branching across multiple functions
int* allocate_array(int N){
int* ptr = (int *) malloc(N * sizeof(int));  In this example enter data and exit data are
#pragma acc enter data create(ptr[0:N])
return ptr; in different functions
}
 This allows the programmer to put device
void deallocate_array(int* ptr){
#pragma acc exit data delete(ptr) allocation/deallocation with the matching
free(ptr); host versions
}

int main(){  This pattern is particularly useful in C++,


int* a = allocate_array(100); where structured scopes may not be
#pragma acc kernels
{ possible.
a[0] = 0;
}
deallocate_array(a);
}
DATA SYNCHRONIZATION
OPENACC UPDATE DIRECTIVE
update: Explicitly transfers data between the host and the device
Useful when you want to synchronize data in the middle of a data region
Clauses:
self: makes host data agree with device data
device: makes device data agree with host data

#pragma acc update self(x[0:count])


#pragma acc update device(x[0:count])
C/C++
!$acc update self(x(1:end_index))
!$acc update device(x(1:end_index))
Fortran
OPENACC UPDATE DIRECTIVE
#pragma acc update device(A[0:N])

The data must exist on


A A*
A
both the CPU and device
CPU Memory device Memory
for the update directive
to work.

B*
B B*
#pragma acc update self(A[0:N])
SYNCHRONIZE DATA WITH UPDATE
int* allocate_array(int N){  Inside the initialize function we alter the
int* A=(int*) malloc(N*sizeof(int));
#pragma acc enter data create(A[0:N]) host copy of ‘A’
return A;
}  This means that after calling initialize the
host and device copy of ‘A’ are out-of-sync
void deallocate_array(int* A){
#pragma acc exit data delete(A)
free(A);
 We use the update directive with the
} device clause to update the device copy of
‘A’
void initialize_array(int* A, int N){
for(int i = 0; i < N; i++){  Without the update directive later compute
A[i] = i;
}
regions will use incorrect data.
#pragma acc update device(A[0:N])
}
COPYING DATA IN DATA REGIONS

#pragma acc enter data copyin(A[:m*n],Anew[:m*n])


#pragma acc parallel loop copy(A,Anew)
for( int j = 1; j < n-1; j++)
 But wouldn't this code now result in my arrays being copied twice, once by the `data`
region and then again by the `parallel loop`? In fact, the OpenACC runtime is smart
enough to handle exactly this case. Data will be copied _in_ only the first time its
encountered in a data clause and _out_ only the last time its encountered in a data
clause. This allows you to create fully-working directives within your functions and
then later _"hoist"_ the data movement to a higher level without changing your code
at all. This is part of incrementally accelerating your code to avoid incorrect results.
MODULE REVIEW
KEY CONCEPTS
In this module we discussed…
 Why explicit data management is necessary for best performance
 Structured and Unstructured Data Lifetimes
 Explicit and Implicit Data Regions
 The data, enter data, exit data, and update directives
 Data Clauses
LAB ASSIGNMENT
In this module’s lab you will…
 Update the code from the previous module to use explicit data
directives
 Analyze the different between using CUDA Managed Memory and
explicit data management in the lab code.

You might also like