0% found this document useful (0 votes)
98 views16 pages

Openmp Offload Infrastructure in LLVM

This document discusses the OpenMP 4.5 offloading infrastructure in LLVM. It proposes a model for generating "fat binaries" that contain both host code and target code for accelerators. The key points are: 1) OpenMP 4.5 defines directives for offloading work to accelerator devices. The infrastructure aims to support multiple target devices at runtime. 2) Fat binaries will contain host code, target code in assembly/IR form, and target-specific runtime libraries. 3) A libomptarget library provides a common interface and handles offloading work to available targets based on their runtime libraries.

Uploaded by

Ramon Nepomuceno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views16 pages

Openmp Offload Infrastructure in LLVM

This document discusses the OpenMP 4.5 offloading infrastructure in LLVM. It proposes a model for generating "fat binaries" that contain both host code and target code for accelerators. The key points are: 1) OpenMP 4.5 defines directives for offloading work to accelerator devices. The infrastructure aims to support multiple target devices at runtime. 2) Fat binaries will contain host code, target code in assembly/IR form, and target-specific runtime libraries. 3) A libomptarget library provides a common interface and handles offloading work to available targets based on their runtime libraries.

Uploaded by

Ramon Nepomuceno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

OpenMP offload infrastructure in LLVM

Samuel Antao (IBM), Carlo Bertolli (IBM), Andrey Bokhanko (Intel), Alexandre
Eichenberger (IBM), Hal Finkel (Argonne National Laboratory), Sergey Ostanevich (Intel),
Eric Stotzer (Texas Instruments), Guansong Zhang (AMD)

March 1, 2021

1 OpenMP 4.5 offloading overview


The OpenMP 4.5 specification defines offloading directives that can be used to take advantage of ac-
celerators or devices, in OpenMP terminology (see the OpenMP 4.5 specification at www.openmp.org).

• Device: an implementation-defined logical execution unit. The execution model is host-


centric such that a host device offloads code and data to target devices.

• Target regions are structured code blocks that execute on a target device. This is conditional
on the run-time availability of a device, the ability of the compiler to generate device code,
and other factors.

• Target declarations are used to specify mapping of global variables to a device and create
device specific versions of functions that can be called from target regions.

• Device data environments contain the variables that are currently present on the target
device.

• A mapped variable is a variable in a (host) data environment with a corresponding variable


in a device data environment.

The target directive creates both a device data environment and a target region. It may have
associated clauses to specify further details, like the exact device to use if more than one is present
in the system (device clause) or whether the data should be moved to/from the device or only
allocated in the device memory (map clause). The declare target directive declares that enclosed
global variables and functions should have corresponding device versions.
Example 1 illustrates how offloading can be expressed using the available set of directives.

2 The goal of the design


• The offload infrastructure should support multiple target device types at runtime and be
extensible in the future with minimal or no changes.

• The infrastructure should determine the availability of target devices at runtime and make a
decision to offload depending on the availability and load of a target device.

1
3 MODEL OF USE

// f o o ( ) w i l l be implemented f o r t h e h o s t and d e v i c e
#pragma omp d e c l a r e target
int foo ( int [ 1 0 0 0 ] ) ;
#pragma omp end d e c l a r e target

// A l l d e c l a r a t i o n s o u t s i d e a d e c l a r e t a r g e t r e g i o n w i l l NOT be implemented
// f o r t h e d e v i c e
...
int d e v i c e c o u n t = omp get num devices ( ) ;
int d e v i c e n o ;
int ∗ red = malloc ( d e v i c e c o u n t ∗ sizeof ( int ) ) ;
int c [ 1 0 0 0 ] ;

// S e v e r a l h o s t t h r e a d s a r e g o i n g t o be spawned t o e x e c u t e t h e s t r u c t u r e d
// b l o c k a s s o c i a t e d w i t h t h e p a r a l l e l r e g i o n ( t h i s i s u n r e l a t e d w i t h t h e
// o f f l o a d i n g s u p p o r t )

#pragma omp p a r a l l e l f o r
f o r ( i = 0 ; i < 1 0 0 0 ; i ++) {

device no = i % device count ;

// The t a r g e t d i r e c t i v e s p e c i f i e s t h a t t h e e x e c u t i o n o f a s s o c i a t e d
// s t r u c t u r e d b l o c k ( t a r g e t r e g i o n ) s h o u l d be t r a n s f e r r e d t o t h e d e v i c e .
//
// The d e v i c e c l a u s e s t a t e s t h a t d e v i c e whose ID i s d e v i c e n o s h o u l d be
// used .
//
// The f i r s t map c l a u s e s p e c i f i e s t h a t an i n s t a n c e o f c has t o be a l l o c a t e d
// i n t h e d e v i c e and u p d a t e d w i t h t h e h o s t c o n t e n t o f c p r i o r t h e e x e c u t i o n
// o f t h e t a r g e t r e g i o n ( t o : ) .
//
// The s e co n d map c l a u s e s p e c i f i e s t h a t an i n s t a n c e o f r e d [ i ] has t o be
// a l l o c a t e d i n t h e d e v i c e and u p d a t e d w i t h t h e h o s t c o n t e n t p r i o r t h e
// e x e c u t i o n o f t h e t a r g e t r e g i o n and t h a t t h e h o s t i n s t a n c e s h o u l d be
// u p d a t e d w i t h t h e c o n t e n t o f t h e d e v i c e i n s t a n c e a f t e r t h e t a r g e t r e g i o n
// e x e c u t i o n i s c o m p l e t e .
//
// I f f o r some r e a s o n t h e d e v i c e i s n o t a v a i l a b l e , a h o s t v e r s i o n o f t h e
// t a r g e t r e g i o n i s e x e c u t e d i n s t e a d .
#pragma omp target device ( d e v i c e n o ) map( to : c ) map( r e d [ i ] )
{
// This code i s g o i n g t o be e x e c u t e d on t h e d e v i c e ( s )
r e d [ i ] += f o o ( c ) ;
}

// The e x e c u t i o n o f t h e t a r g e t r e g i o n i s b l o c k i n g u n l e s s t h e n o w a i t c l a u s e i s used .
// I f b l o c k i n g , t h e d e d i c a t e d h o s t t h r e a d w i l l w a i t f o r t h e d e v i c e t o c o m p l e t e e x e c u t i o n .
}

f o r ( i = 0 ; i < d e v i c e c o u n t ; i ++)
t o t a l r e d += r e d [ i ] ;

Example 1: Offloading expressed with OpenMP directives.

3 Model of use
The proposed offload mechanism implements the OpenMP 4.5 target data, target and declare
target constructs. The compiler generates calls to the runtime library whenever a target data
or target directive is encountered. The declare target construct will result in the generation of
appropriate target code for the target device.
The target code is stored inside the host binaries, creating fat binaries. Target code is stored

2
4 GENERATION OF FAT BINARIES

Figure 1: Schematic of Figure 1


libomptarget.so interface

4. Generation
in designated of fat binaries
binary regions (e.g. ELF sections) with an appropriate naming convention. The
target
To code
make is either
code targetconsistent
generation assemblyand in binary form (ELF,
straightforward PE, etc.)
the following or aishigher-level
scheme proposed: intermediate
representation (IR) such as LLVM IR or any other type of IR. If the target code is stored as IR,
1. For each source
an implementation file provided,
can support the driver
on-the-fly spawns theinto
compilation execution
targetofassembly.
preprocessor, compiler and
assembler for the host and each available target device
A target-independent offload runtime library named libomptarget.so type. This results supports
in the generation of an
multiple target
object
device types. Thefile for each target device
libomptarget.so type.utilizes
library The toolchain of a given target
device-specific targetmay be modified
run-time so that
libraries it
(RTLs).
uses the same definitions (header files) as the host toolchain
At the start of host code execution libomptarget.so will do the following: if that suits the system constraints.
2. Target linkers combine dedicated target objects into target shared libraries – one for each target
1. Search for atype.
device target
TheRTL that supports
commands thetarget
passed to the device binary.by the compiler driver always assume
frontend
the creation of a shared library even if the commands passed to the driver by the user specify
2. Verify target RTL interface compliance.
otherwise. The driver performs the translation of the host frontend commands to target
3. Add target
frontendRTLs into atolist
commands of available
assure target
that a target device
shared librarytypes.
is generated.
3. The host linker combines host object files into an executable/shared library and incorporates
libomptarget.so provides
each target shared the ashost
libraries with
is (no anlinking
actual API istodone
mapbetween
variables
hostand initiate
and target execution
objects) into of
target regions on aseparate
its own target section
device.within
Afterthelibomptarget.so has of
host binary. The format verified
a binarythat suitable
section target to
for offloading code is
present anda that a target RTL is ready to execute a target region, the target RTL
specific device is target-dependent and will be thereafter handled by the target RTL at
is invoked via
API routines(described below) to execute the region.
runtime.
This scheme provides flexibility for generating code for multiple heterogeneous device types.
4. A new driver command-line option –omptargets=T1,…,Tn where Ti are valid target triples
For example, the target region code in Example 1 can be executed in a PhiTM coprocessor, GPU
that specify which target device types the user wants to support in the execution of OpenMP
or DSP if any of these devices is present in the system (see Figure 1). The design of the offloading
target regions. An example, of the invocation of the compiler would be:
interface does not limit the number and type of devices associated to a host processor or the ability
to use these atclang
the same time. If
–fopenmp both devices
–target are present, different –omptargets=nvptx64-
powerpc64-ibm-linux-gnu iterations of the loop can be
nvidia-cuda,x86-pc-linux-gnu
executed on both devices at the same time. foo.c bar.c –o foobar.bin

for a hypothetical system where the host is a PowerPC processor and the available target device
4 Generation of fat
types are an NVIDIA GPUbinaries
and x86 processor.

To make5. code generation


For each consistent
source file passed toand straightforward
the driver thefile
a unique object following
is createdscheme
for eachistarget
proposed:
device
type. The naming convention for a target object file is to append a suffix tgt-<some-target-
1. For each source file provided, the driver spawns the execution of preprocessor, compiler and
triple> to the host object file name. At link time the driver forwards the target object files to
assembler for the host and each available target device type. This results in the generation of
the corresponding target toolchain. This mechanism underlies the compiler support for separate
an object file for each target device type. The toolchain of a given target may be modified so
compilation. For example:
that it uses the same definitions (header files) as the host toolchain if that suits the system
constraints.
clang –fopenmp –target powerpc64-ibm-linux-gnu –omptargets=nvptx64-
nvidia-cuda,x86-pc-linux-gnu foo.c -c
3
5 NATIVE RUNTIME

2. Target linkers combine dedicated target objects into target shared libraries, one for each
target device type. The commands passed to the target frontend by the compiler driver
always assume the creation of a shared library even if the commands passed to the driver by
the user specify otherwise. The driver performs the translation of the host frontend commands
to target frontend commands to assure that a target shared library is generated.
3. The host linker combines host object files into an executable/shared library and incorporates
each target shared libraries as is (no actual linking is done between host and target objects)
into a designated section within the host binary. The format of a binary section for offloading
to a specific device is target-dependent and will be thereafter handled by the target RTL at
runtime.
4. A new driver command-line group option {target-offoad=Ti where Ti is a valid target
triples that specify which target device types the user wants to support in the execution of
OpenMP target regions. All options following -target-offload=Ti are forwarded to that
device toolchain. The user can specify as many -target-offload=Ti options as devices he
wants to support. An example, of the invocation of the compiler would be:

clang -{fopenmp -{target powerpc64-ibm-linux-gnu


{-target-offload=nvptx64-nvidia-cuda --fopenmp --target-offload=x86-pc-linux-gnu
-fopenmp foo.c bar.c {-o foobar.bin

for a hypothetical system where the host is a PowerPC processor and the available target
device types are an NVIDIA GPU and x86 processor.
5. For each source file, the compiler driver will issue commands to to create intermediate files
for each possible compilation phase (LLVM IR, assembly, object) and target (host or device).
However, that is not exposed to the user as the driver has the ability to bundle multiple
(related) files generated by different toolchains into a single one. Therefore, when using sepa-
rate compilation, the user should invoke the compiler in the same way (except for the device
target specification) he would if no offloading support was required. For example:

clang -{fopenmp -{target powerpc64-ibm-linux-gnu


{-target-offload=nvptx64-nvidia-cuda --fopenmp --target-offload=x86-pc-linux-gnu
-fopenmp foo.c bar.c -c

clang -{fopenmp -{target powerpc64-ibm-linux-gnu


{-target-offload=nvptx64-nvidia-cuda --fopenmp --target-offload=x86-pc-linux-gnu
-fopenmp foo.o bar.o {-o foobar.bin
The resulting host executable/shared library will depend on the offload runtime library -–
libomptarget.so. This library will handle the initialization of target RTLs and translate the
offload interface from compiler-generated code to the target RTL during program execution.

5 Native Runtime
The native runtime implements the following OpenMP 4.5 routines.
• void omp set default device(int device num) call sets the default device ID in the ICV of
the encountering task.

4
6 THE LIBOMPTARGET.SO SYNCHRONOUS INTERFACE

• int omp get default device(void) call retrieves the default device ID in the ICV of the en-
countering task.

• int omp get num teams(void) returns the number of teams for target teams construct exe-
cuting on the host.

• int omp get team num(void) returns the team number associated with the encountering
task for target teams construct executing on the host.

Several target constructs have nowait and depend clauses. These clauses enable asynchronous
execution and at the same time enforce the required data dependences. These dependences can
be handled by the native runtime like any other dependences; however, some devices may enforce
dependences between target tasks more efficiently. The offloading infrastructure supports offloading
the enforcement of these dependences directly to the devices. Native runtimes are not forced to
use this infrastructure, as they can always elect to enforce dependences natively.
When electing to offload the enforcement of dependences among target tasks, a native runtime
will need to add two data fields to their target-task data-structures: first, a field that contains
the device ID associated with the target task; and second, a field that contains a reference to a
completion synchronization event associated with the completion of the target task. This event
is an abstract data type, whose content is specific to a given device type. A native runtime also
communicate to the offloading infrastructure. This is accomplished with callback functions that
enable the offloading infrastructure to instruct the native runtime that a given target task has
completed. This callback enables a native runtime to clean up the dependence data-structure of
the completed target task, and to activate other host tasks that may be dependent on the completion
of that target task.
An overview of the flow of operations is as follows. A native runtime obtains a reference
to an synchronization event when launching an asynchronous target task via an asynchronous
libomptarget.o interface call. When handling dependences for a new target task B, if the native
runtime detects that B is dependent on another target task, say A, and that this specific depen-
dences (A → B) can be handled by their respective devices, then the native runtime will provide
the synchronization event associated with the completion of A to the asynchronous libomptarget.o
interface call that launches B.

6 The libomptarget.so synchronous interface


The offload library implements the OpenMP 4.5 user-level runtime library routines:

• int omp get num devices(void)

• int omp get initial device (void)

• int omp is initial device (void)

The offload library implements the following compiler-level runtime library routines:

• void tgt register lib ( tgt bin desc ∗ desc)

Register the libomptarget.so library and initialize target state (i.e. global variables and
target entry points) for the current host shared library/executable and the corresponding
target execution images that have those entry points implemented. This does not trigger

5
6 THE LIBOMPTARGET.SO SYNCHRONOUS INTERFACE

any execution in any target as any real work with the target device can be postponed until
the first target region is encountered during execution. This function is expected to ap-
pear only once per host shared library/executable in the .init section and is called before
any constructors or static initializers are called for the host. The name of the caller of
tgt register lib follows the same pattern of the C++ initializers in Clang and is set to
GLOBAL A 000000 OPENMPTGT.

• void tgt target data begin ( int32 t device id , int32 t num args, void∗∗ args base, void
∗∗ args, int64 t ∗ args size , int32 t ∗args maptype)

Initiate a device data environment. It maps variables from the host data environment to
the device data environment by recording the mapping between the references of variables
used in the host and target into the libomptarget.so internal structures. The associated
variables in the target device data environment are initialized according to the map-type. In
the event a given associated variable has already been mapped in other enclosing device data
environment, no action is taken for that variable.

• void tgt target data end ( int32 t device id , int32 t num args, void∗∗ args base, void
∗∗ args, int64 t ∗ args size , int32 t ∗args maptype)

Close a device data environment. It removes mapped variables from the current device data
environment, releases target memory and destroy the mappings created by tgt target data begin
() that initiated the current device data environment. It assigns host variables the value of
the corresponding device data environment variable according to the map-type.

• void tgt target data update ( int32 t device id , int32 t num args, void∗∗ args base, void
∗∗ args, int64 t ∗ args size , int32 t ∗args maptype)

Make the value of a set of variables consistent between the host device and a target device.
If the variable’s map type is from, use the value of the variable on the target device. If a
variable’s map type is to, use the value of the variable on the host device.

• int32 t tgt target ( int32 t device id , void ∗host addr, int32 t num args, void∗∗ args base
, void∗∗ args, int64 t ∗ args size , int32 t ∗args maptype)

Perform the same actions as tgt target data begin in case arg num is non-zero and launch
the execution of the target region on the target device; if arg num is non-zero after the
region execution is done it also performs the same action as tgt target data end above. If
offloading fails, an error code is returned, which notifies the caller that the associated target
region has to be executed by the host. The return code can be used as an error code which
will give the compiler and run-time the freedom to implement optimized behaviors.

• int32 t tgt target teams ( int32 t device id , void ∗host addr, int32 t num args, void∗∗
args base, void∗∗ args, int64 t ∗ args size , int32 t ∗args maptype, int32 t num teams
, int32 t thread limit)

This is an extension of tgt target where the caller is able to specify the (maximum) number
of teams and threads in each team that should be used by libomptarget.so when it launches

6
6 THE LIBOMPTARGET.SO SYNCHRONOUS INTERFACE
6.1 Arguments for the libomptarget.so calls

the execution of the region. It reflects the common nesting of target and teams directives
in OpenMP 4.5.

All the tgt target ... calls presented above perform an initial check to understand if the
target specified by device id was already initialized, and if not, triggers that initialization. The
information registered by tgt register lib is used to accomplish that.

6.1 Arguments for the libomptarget.so calls


The following describes the arguments used by the libomptarget.so interface and how they are
used.

tgt bin desc ∗ desc points to a constant data struct statically defined by the compiler:
struct tgt bin desc {
u i n t 3 2 t NumDevices ;
t g t d e v i c e i m a g e ∗ DeviceImages ;
t g t o f f l o a d e n t r y ∗ EntriesBegin ;
t g t o f f l o a d e n t r y ∗ EntriesEnd ;
};
NumDevices is the number of device types whose execution image was generated by the com-
piler in order to implement some of the offloading entry points. The device types are specified by the
user during the invocation of the compiler by passing the group option {target-offload=Ti where
Ti is the triple of a target the user wants to support. If n group options {target-offload=Ti
associated with different device triples are specified when the compiler is invoked, NumDevices
will n.
DeviceImages is a pointer to an array of NumDevices elements, whose element type is:
struct tgt device image {
void ∗ ImageStart ;
v o i d ∗ ImageEnd ;
t g t o f f l o a d e n t r y ∗ EntriesBegin ;
t g t o f f l o a d e n t r y ∗ EntriesEnd ;
};
where ImageStart and ImageEnd contain the addresses where a target image associated to the cur-
rent host executable/shared library for a given device type starts and ends, respectively. ImageEnd
is non-inclusive, i.e. it points to the byte immediately after the target image ends.
EntriesBegin and EntriesEnd point to the first and last element of an array that contains the
information of each global variable and target entry point that require a map between host and
target. These pointers are present in both tgt bin desc and tgt device image so that the
target dependent runtime (see Section 5) can use this information as well to more easily retrieve
the entries from the target image (e.g. to retrieve the symbol names of the entries – see below).
Each element of the array pointed by EntriesBegin has type:
struct tgt offload entry{
v o i d ∗ addr ;
c h a r ∗name ;
int64 t size ;
};
where addr is the address of that global variable or entry point in the host, name is the name of
the symbol that refers to that global variable or entry point, and size is the size in bytes of the

7
6 THE LIBOMPTARGET.SO SYNCHRONOUS INTERFACE
6.1 Arguments for the libomptarget.so calls

global variable or zero if it is an entry point. If the address field is set to NULL, the corresponding
entry in the table is not supported by the target device associated to this table.

libomptarget.so has to be able to map the host entries to the corresponding device entries.
There are different strategies that can be used for mapping the entries. The simplest one is
to use the names of the entries for the mapping by performing a name-based search using the
name field of the tgt offload entry . The field name is useful for targets whose runtime requires
access to the symbol names in order to locate the correspondent address in the target image. The
array that starts at EntriesBegin is built by the compiler in conjunction with the linker, which
forwards sequences of entries of different compilation units to the same binary section. In the event
target and host toolchains can provide strict ordering for both target and host tables – then the
mapping can be done by the sequence number of the entry. The frontend ensures that the global
variables/entries follow the same order for both host and device. If the toolchain of a given target
does not preserve the order, that target runtime may consult the host entries and obtain the same
order of the symbols based on the name.

Static initializers and global destructors associated with device entries are implemented as
target regions and executed in the same order they are required in the program. The callers of the
constructors and destructors are always launched with a single thread and team. Example 2 shows
a code snippet that would require the creation of these target regions.

#pragma omp d e c l a r e target


c l a s s C{
C( ) { // c t o r o f C}
˜C( ) { // d t o r o f C}
};
C a;
#pragma omp end d e c l a r e target

foo () {
#pragma omp target
{ // t a r g e t r e g i o n 1 }
}

#pragma omp d e c l a r e target


C b;
#pragma omp end d e c l a r e target

bar ( ) {
#pragma omp target
{ // t a r g e t r e g i o n 2 }
}

Example 2: Example requiring the creation of target regions to implement the device variables
global initializers and destructors.

int32 t device id is an integer that uniquely identifies a given target. In the first call of
tgt register lib , libomptarget.so detects the available target dependent RTLs in the system
and uses them to query the number of devices of each type that are ready to be used. If A devices
of type T1 and B devices of type T2 are found, where T1 and T2 are the types specified in that
exact order by the user with -{target-offload=Ti, device id [0, A[ will map to devices of type
T1 and device id [A, A + B[ will map to devices of type T2. If using separate compilation, the
user must use the same order of triples in the separate compiler runs, otherwise he cannot assume

8
6 THE LIBOMPTARGET.SO SYNCHRONOUS INTERFACE
6.1 Arguments for the libomptarget.so calls

a unique map between device IDs and types. If device id is greater or equal than A + B, the call
where it is used will fail (region would be executed by the host) and no further action will be taken
by libomptarget.so. On top of the positive values used for device id , the compiler also employs
three reserved values:

• device id = −1: informs the runtime that the user has not specified any device ID, and
therefore the default must be used, which may be specified through an environment variables
as specified in the OpenMP 4.5 specification.

• device id = −2: informs the runtime that the target action must be performed on all avail-
able devices and can be delayed until the first time the device action is invoked with a device
ID greater or equal to −1. This is mainly used to call C++ global initializers, which only
need to be called if the device is eventually used for executing at least one target region.

• device id = −3: informs the runtime that the target action must be performed on all avail-
able devices that were ever used in the current library. This is mainly used to call C++
destructors, which are only required if that device was used before.

int32 t num args is the number of data pointers that require a mapping.

void∗∗ args is a pointer to an array with num args arguments whose elements point to the first
byte of the array section that needs to be mapped.

int64 t ∗ args size is a pointer to an array with num args arguments whose elements contain
the size in bytes of the array section to be mapped.

void∗∗ args base is a pointer to an array with num args arguments whose elements point to
the base address of the declaration the mapping refers to. args base differs from args if an array
section does not start at zero. libomptarget.so needs to know the base addresses in order to
relate mapped data with target region arguments that are dereferenced in the target region body.

int32 t ∗args maptype is a pointer to an array with num args arguments whose elements
contain the required map attributes as specified in the enum:
enum t g t m a p t y p e {
OMP TGT MAPTYPE TO = 0 x0001 ,
OMP TGT MAPTYPE FROM = 0 x0002 ,
OMP TGT MAPTYPE ALWAYS = 0 x0004 ,
OMP TGT MAPTYPE DELETE = 0 x0008 ,
OMP TGT MAPTYPE MAP PTR = 0 x0010 ,
OMP TGT MAPTYPE FIRST REF = 0 x0020 ,
OMP TGT MAPTYPE RETURN PTR = 0 x0040 ,
OMP TGT MAPTYPE PRIVATE PTR = 0 x0080 ,
OMP TGT MAPTYPE PRIVATE VAL = 0 x0100
};
OMP TGT MAPTYPE TO instructs the runtime to copy host data to the device in a
tgt target data begin , tgt target update , or before the kernel execution is triggered byt
tgt target . OMP TGT MAPTYPE FROM instructs the runtime to copy device data to
the host in a tgt target data end , tgt target update , or when finalizing tgt target .

9
6 THE LIBOMPTARGET.SO SYNCHRONOUS INTERFACE
6.1 Arguments for the libomptarget.so calls

OMP TGT MAPTYPE ALWAYS forces the copying regardless of the reference count associated
with the map. OMP TGT MAPTYPE DELETE forces the unmapping of the object in a target
data end or when completing a target construct. OMP TGT MAPTYPE MAP PTR forces the
runtime to map the pointer variable as well as the pointee variable. This attribute also instructs the
runtime to initialize the value of the pointer on the device with the base device address of the pointee
mapped variable. The attributes OMP TGT MAPTYPE TO, OMP TGT MAPTYPE FROM
, OMP TGT MAPTYPE ALWAYS, and OMP TGT MAPTYPE DELETE only apply to the
pointee variable.
OMP TGT MAPTYPE FIRST REF instructs the runtime that it is the first occurrence of
this mapped variable within this construct, or the associated map information is the first one that
relates with that variable (the map of a variable can result in multiple sets of map information to be
passed to the runtime). When used with the OMP TGT MAPTYPE MAP PTR, it instructs the
runtime that it is the first occurrence of the pointer mapped variable; a pointee mapped variable
is considered by definition as its first occurrence. These attributes are used to determine when to
update the reference count associated with a mapped variable, and when to include the base device
address of a mapped variable in the device instance of a pointer variable.
OMP TGT MAPTYPE RETURN PTR instructs the runtime to return the base device
address of the mapped variable in the corresponding base args entry. When used with the
OMP TGT MAPTYPE MAP PTR attribute, the runtime returns the base device address of
the pointer mapped variable. This attribute is used to implement the use device ptr clause.
The next two flags indicate to the runtime that it is not dealing with mapped variables. First,
OMP TGT MAPTYPE PRIVATE PTR informs the runtime that the described variable is a
private variable. First-private variables are indicated by using the OMP TGT MAPTYPE TO
attribute in conjunction with the private attribute. The runtime is responsible for allocating
the device memory associated with the variable, but it is not mapped as it is private. Second,
OMP TGT MAPTYPE PRIVATE VAL instructs the runtime to simply forward the value of the
args base parameter to the target construct. This attribute is used to implement the is device ptr
clause. This attribute can also be used by the compiler to forward small first-private values directly
to the device via the kernel parameters. No allocation of memory occurs with this attribute.
The fields in the interface are used in three different ways depending on the presence of the
OMP TGT MAPTYPE MAP PTR and OMP TGT MAPTYPE PRIVATE VAL attributes.

• Variables without the OMP TGT MAPTYPE MAP PTR and


OMP TGT MAPTYPE PRIVATE VAL attributes. args base contain the references
to the variables that are being mapped. args points to the beginning of the data as requested
by the user. For scalars, args base and args are identical. For arrays, the user may request a
subset of arrays, in which case args points to the beginning of these array subsets. Similarly,
for struct, the user may request to map only a subset of the struct fields, in which case args
points to the first fields of these structs. args size contains the sizes of the variables, or the
appropriate subsets of the variables, that are being mapped. args maptype contains the
attributes that control the data movement between host and device.

• Variables with the OMP TGT MAPTYPE MAP PTR attribute and without the
OMP TGT MAPTYPE PRIVATE VAL attribute. Arguments for mapped pointers de-
scribe both the pointer and pointee variables. args base contains the references to the pointer
variables to be mapped. If needed, the references to the pointee variables are found by deref-
erencing the args base pointers. args contains the beginning of the data of the pointee
variables. For arrays and structs, args points to the beginnings of the subsets of the pointee

10
7 THE LIBOMPTARGET.SO ASYNCHRONOUS INTERFACE

variables that are being mapped. args size contains the sizes of the pointee variables, or
the appropriate subset of the variables, that are being mapped. args maptype contains the
attributes that control the data movement of the pointee variable between host and device.
By definition, the values of the mapped pointer variables on the device are never copied back
to the host.

• Variables with the OMP TGT MAPTYPE PRIVATE VAL attribute. The values to for-
ward are found in the args base arguments; args and args size values are unused.

Zero-length array references are identified by their args size values of zero; and since they
cannot have an offset, their args values are unused.
Table 1 shows the content of the arrays passed to tgt target data begin () and tgt target
as result of the target data and target pragmas in Example 3.
Variable i is also passed to tgt target because it is captured in the body of the target region
and are therefore arguments to the target region. Variable pA is also captured, and as captured
pointers, it is mapped as a zero-length array. The libomptarget.so implementation will detect
that s1 was mapped before and will not take any action to map this variable again. Variable C
illustrates the use of use device ptr and is device ptr .
The arguments that are used to invoke the target kernel (see void ∗ target vars ptr in Section 8)
consist of the mapped base address of all elements. In the example above the arguments would be
&di, &dpA, &dA[0], &ds1, &dC, where dX is the map of X in the device.

// N, M and S a r e c o n s t a n t s
foo () {
i n t A[N] , D[N] , ∗pA , i ;
struct S1 { i n t x , y , B [M] , ∗pB , u , v ; } s 1 ;
i n t C [N ] ;

pA = ( i n t ∗ ) m a l l o c (N∗ s i z e o f ( i n t ) ) ;
s 1 . pb = ( i n t ∗ ) m a l l o c (M∗ s i z e o f ( i n t ) ) ;

#pragma omp target data map(pA [ 0 :M] , s1 , C) u s e d e v i c e p t r (C)


{
/∗ C now c o r r e s p o n d t o t h e d e v i c e p o i n t e r a s s o c i a t e d w i t h mapped v a r i a b l e C, and can be
used i n a Cuda c a l l , f o r example ∗/
#pragma omp target map( to : A[ S :M] , s 1 . B [ 0 :M] ) map( from : s 1 . pb [ 0 :M] ) i s d e v i c e p t r (C)
f i r s t P r i v a t e (D [ 0 :M] )
{
f o r ( i=S ; i <M; ++i ) {
++A[ i ] ;
−−pA [ i ] ;
s 1 . pb [ i −S ] = s 1 . B [ i −S ] + A[ i ] ∗ pA [ i ] − C [ i ] ∗ D[ i ] ;
}
}
}
}

Example 3: Example requiring mapping of pointer.

7 The libomptarget.so asynchronous interface


This section describes the libomptarget.so interface dedicated to handle asynchronous target
task constructs. We first introduce an abstract data type to hide the implementation details of the
RTL-specific synchronization events to the RTL interface.

11
7 THE LIBOMPTARGET.SO ASYNCHRONOUS INTERFACE

args base args args size args maptype comment


#pragma omp target data − tgt target data begin()
pA &pA[0] M∗sizeof(int) TO | FROM | mapped array subset
FIRST REF
&s1 &s1 sizeof (S1) TO | FROM | mapped struct
FIRST REF
&C &C N∗sizeof(int ) TO | FROM | mapped array for
FIRST REF | which the compiler
RETURN PTR want the device
address
#pragma omp target − tgt target()
pA undefined 0 TO | FROM| zero-length first-
FIRST REF private pointer to a
mapped array
&A &A[S] M∗sizeof(int) TO | mapped array
FIRST REF
&s1 &s1.B[0] M∗sizeof(int)+sizeof(void ∗) FIRST REF allocation of mapped
struct subset
&s1 &s1.B[0] M∗sizeof(int) TO memory move for
subset of mapped
struct
&s1.pB &s1.pB[0] M∗sizeof(int) FROM | mapped pointer and
MAP PTR pointee array subset
C undefined undefined PRIVATE VAL mapped pointer and
pointee array subset
i undefined undefined PRIVATE VAL first-private scalar
passed as value
&D &D[0] M∗sizeof(int) PRIVATE PTR first-private array
| TO

Table 1: Contents of the arrays passed through the interface for Example 3.
OMP TGT MAPTYPE prefixes are omitted for conciseness.

12
7 THE LIBOMPTARGET.SO ASYNCHRONOUS INTERFACE

• typedef void ∗ tgt event t ;

We also introduce the signature of a callback function that will be used by libomptarget.so
to indicate to the native runtime that a given asynchronous target task has fully completed.

• typedef void ∗ tgt callback param t ;

• typedef void (∗ tgt callback fct )(tgt callback param t param);

If a native runtime elects to offload the enforcement of target-task-to-target-task dependences


to a device, it can use the following functions to enable device enforced dependences.

• void tgt init dep ( tgt callback fct completion callback)

This call indicates to the offloading infrastructure the callback function that must be called
once an async target task has completed. This initialization call must be performed once,
prior to processing any asynchronous operations.

• int32 t tgt enforce dep ( int32 t from device id, int32 t to device id , tgt event t completion even
)

This call returns an integer that identifies if a dependence from a first target task (with device
ID from device id and associated with a completion synchronization event completion event
) to a second target task (scheduled to execute on device with device ID to device id ) can
be satisfied by the devices. If the call returns a zero value, this dependence must be enforced
on the host. If the call returns a nonzero value, this dependence must now be enforced by
the devices.

• void tgt release dep ( tgt sync event t completion event)

This call indicates to the offloading libraries that the native runtime will issue no further
requests to enforce dependences tagged with the synchronization event completion event.
This call enables the offloading libraries to manage the live ranges of the data structures
associated with synchronization events.

• int32 t tgt process dep ()

This call enables the offloading libraries to process dependences. The call returns zero if no
further dependence processing is required; nonzero values indicates that a further call may
be required in the future. The native runtime must call this function while a host task is
waiting for some target-task dependences to be resolved by the devices. This function enables
the offloading libraries to perform polling on completed synchronization events that requires
host-side processing, for example to cleanup the host side dependence data structures.

The following calls are asynchronous versions of libomptarget.so calls seen in Section 6.

• void tgt target async data begin ( int32 t device id , int32 t num args, void∗∗ args base
, void∗∗ args, int64 t ∗ args size , int32 t ∗args maptype, int32 t num dep events, tgt event t
∗dep events, tgt event t ∗completion event, tgt callback param t param)

13
8 TARGET RTL SYNCHRONOUS INTERFACE

• void tgt target async data end ( int32 t device id , int32 t num args, void∗∗ args base
, void∗∗ args, int64 t ∗ args size , int32 t ∗args maptype, int32 t num dep events, tgt event t
∗dep events, tgt event t ∗completion event, tgt callback param t param)

• void tgt target async data update( int32 t device id , int32 t num args, void∗∗ args base
, void∗∗ args, int64 t ∗ args size , int32 t ∗args maptype, int32 t num dep events, tgt event t
∗dep events, tgt event t ∗completion event, tgt callback param t param)

• int32 t tgt target async ( int32 t device id , void ∗host addr, int32 t num args, void∗∗
args base, void∗∗ args, int64 t ∗ args size , int32 t ∗args maptype, int32 t ∗args maptype
, int32 t num dep events, tgt event t ∗dep events, tgt event t ∗completion event, tgt callback param
param)

• int32 t tgt target async teams ( int32 t device id , void ∗host addr, int32 t num args
, void∗∗ args base, void∗∗ args, int64 t ∗ args size , int32 t ∗args maptype, int32 t num teams
, int32 t thread limit, int32 t ∗args maptype, int32 t num dep events, tgt event t ∗dep events
, tgt event t ∗completion event, tgt callback param t param)

Operations for the above calls will only occur after the num dep events synchronization events
listed in dep events are satisfied. The call returns in completion event the synchronization event
indicating the completion of the operations. Upon completion, the offloading interface will callback
the runtime using the given callback parameter param provided in the above calls.

8 Target RTL synchronous interface


As it can be derived from the previous Sections, a target RTL must provide the following capabil-
ities:

• int32 t tgt rtl number of devices () – return the number of available devices of the type
supported by the target RTL.
• int32 t tgt init device ( int32 t device id ) – initialize the specified device. In case of
success return 0; otherwise return an error code.
• tgt target table ∗ tgt rtl load binary ( int32 t device id , tgt device image ∗image)
– pass an executable image section described by image to the specified device and prepare an
address table of target entities. In case of error, return NULL. Otherwise, return a pointer
to the built address table. Individual entries in the table may also be NULL, when the
corresponding offload region is not supported on the target device (see previous Section).
• int32 t tgt rtl is valid binary ( tgt device image ∗image) - return an integer differ-
ent from zero if the provided device image can be supported by the runtime. The functionality
is similar to comparing the result of tgt rtl load binary to null. However, this is meant
to be a lightweight query to determine if the RTL is suitable for an image without having to
load the library, which can be expensive.
• void∗ tgt rtl data alloc ( int32 t device id , int64 t size ) – allocate data on the par-
ticular target device, of the specified size. Return address of the allocated data on the target
that will be used by libomptarget.soto initialize the target data mapping structures. These
addresses are used to generate a table of target variables to pass to tgt rtl run region () .
The tgt rtl data alloc () returns NULL in case an error occurred on the target device.

14
9 TARGET RTL ASYNCHRONOUS INTERFACE

• int32 t tgt rtl data submit ( int32 t device id , void ∗target ptr , void ∗host ptr, int64 t
size ) – pass the data content to the target device using the target address. In case of success,
return zero. Otherwise, return an error code.

• int32 t tgt rtl data retrieve ( int32 t device id , void ∗host ptr, void ∗target ptr , int64 t
size ) – retrieve the data content from the target device using its address. In case of success,
return zero. Otherwise, return an error code.

• int32 t tgt rtl data delete ( int32 t device id , void ∗target ptr) – de-allocate the data
referenced by target ptr on the device. In case of success, return zero. Otherwise, return an
error code.

• int32 t tgt rtl run target region ( int32 t device id , void ∗target entry ptr , void ∗∗
target vars ptr , int32 t arg num) – transfer control to the offloaded entry on the target
device; target vars ptr is a table to store the target addresses of all variables used in the
target entry code. Entries in target vars ptr match the order of the variables passed in the
arg host ptr argument passed to the target region. In case of success, return zero. Otherwise,
return an error code.

• int32 t tgt rtl run target team region ( int32 t device id , void ∗target entry ptr , void
∗∗ target vars ptr , int32 t arg num, int32 t num teams, int32 t thread limit) – trans-
fer control to the offloaded entry on the target device; target vars ptr is a table to store the
target addresses of all variables used in the target team entry code. Entries in target vars ptr
match the order of the variables passed in the arg host ptr argument passed to the target
team region. In case of success, return zero. Otherwise, return an error code.

For each platform there can be a dedicated set of error numbers defined libomptarget.so
assumes that zero is returned in case of success, independently of the target device. If a given
system supports shared memory, the target RTL implementation of the tgt rtl data ∗() can
optionally not take any action, as host and target device can operate on the same data.

9 Target RTL asynchronous interface


The asynchronous RTL interface augment the calls seen in Section 7 and Section 8. First, it
implements the RTL-specific calls related to offloading dependences.

• void tgt rtl init dep ( tgt callback fct completion callback)

• int32 t tgt rtl enforce dep ( int32 t from device id, int32 t to device id , tgt event t
completion event)

• void tgt rtl release dep ( tgt sync event t completion event)

• int32 t tgt rtl process dep ()

Second, the asynchronous RTL interface provides the following asynchronous data transfer and
offloading calls.

• int32 t tgt rtl async dep ( int32 t device id , int32 t num dep events, tgt event t ∗dep events
, tgt event t ∗completion event)

15
9 TARGET RTL ASYNCHRONOUS INTERFACE

• int32 t tgt rtl async data submit ( int32 t device id , void ∗target ptr , void ∗host ptr
, int64 t size , int32 t num dep events, tgt event t ∗dep events, tgt event t ∗completion event
)

• int32 t tgt rtl async data retrieve ( int32 t device id , void ∗host ptr, void ∗target ptr
, int64 t size , int32 t num dep events, tgt event t ∗dep events, tgt event t ∗completion event
)

• int32 t tgt rtl async run target region ( int32 t device id , void ∗target entry ptr , void
∗∗ target vars ptr , int32 t arg num, int32 t num dep events, tgt event t ∗dep events,
tgt event t ∗completion event)

• int32 t tgt rtl async run target team region ( int32 t device id , void ∗target entry ptr
, void ∗∗ target vars ptr , int32 t arg num, int32 t num teams, int32 t thread limit, int32 t
num dep events, tgt event t ∗dep events, tgt event t ∗completion event)

In the above list, the first call requires no computation or communication, it just enables
libomptarget.so to summarize all of the synchronization events (listed in dep event) into a sin-
gle event, completion event, which must occur after all listed events are completed. Note that
the dep events synchronization event may now also include dependences on the completion of
data transfers initiated by libomptarget.so. For devices that support the concept of streams,
it is expected that the new operation will be queue up in the stream associated with the first
synchronization event in dep event.

16

You might also like