Openmp Examples 5 0 1
Openmp Examples 5 0 1
Application Programming
Interface
Examples
Version 5.0.1 – June 2020
Source codes for OpenMP 5.0.1 Examples can be downloaded from github.
Foreword vii
Introduction 1
Examples 2
1. Parallel Execution 3
1.1. A Simple Parallel Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2. The parallel Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3. teams Construct on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4. Controlling the Number of Threads on Multiple Nesting Levels . . . . . . . . . . . 11
1.5. Interaction Between the num_threads Clause and omp_set_dynamic . . . . 14
1.6. Fortran Restrictions on the do Construct . . . . . . . . . . . . . . . . . . . . . . . 16
1.7. The nowait Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8. The collapse Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9. linear Clause in Loop Constructs . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.10. The parallel sections Construct . . . . . . . . . . . . . . . . . . . . . . . 27
1.11. The firstprivate Clause and the sections Construct . . . . . . . . . . . . 28
1.12. The single Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.13. The workshare Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.14. The master Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.15. The loop Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.16. Parallel Random Access Iterator Loop . . . . . . . . . . . . . . . . . . . . . . . . 39
1.17. The omp_set_dynamic and
omp_set_num_threads Routines . . . . . . . . . . . . . . . . . . . . . . . . 40
1.18. The omp_get_num_threads Routine . . . . . . . . . . . . . . . . . . . . . . 42
i
2. OpenMP Affinity 44
2.1. The proc_bind Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.1. Spread Affinity Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.2. Close Affinity Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1.3. Master Affinity Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2. Task Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3. Affinity Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4. Affinity Query Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3. Tasking 67
3.1. The task and taskwait Constructs . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2. Task Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3. Task Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.1. Flow Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.2. Anti-dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.3. Output Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3.4. Concurrent Execution with Dependences . . . . . . . . . . . . . . . . . . 93
3.3.5. Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.6. taskwait with Dependences . . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.7. Mutually Exclusive Execution with Dependences . . . . . . . . . . . . . . 102
3.3.8. Multidependences Using Iterators . . . . . . . . . . . . . . . . . . . . . . 105
3.3.9. Dependence for Undeferred Tasks . . . . . . . . . . . . . . . . . . . . . . 107
3.4. The taskgroup Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.5. The taskyield Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.6. The taskloop Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.7. The parallel master taskloop Construct . . . . . . . . . . . . . . . . . 118
4. Devices 120
4.1. target Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.1.1. target Construct on parallel Construct . . . . . . . . . . . . . . . . 121
4.1.2. target Construct with map Clause . . . . . . . . . . . . . . . . . . . . 122
4.1.3. map Clause with to/from map-types . . . . . . . . . . . . . . . . . . . . 123
4.1.4. map Clause with Array Sections . . . . . . . . . . . . . . . . . . . . . . . 124
4.1.5. target Construct with if Clause . . . . . . . . . . . . . . . . . . . . . 126
Contents iii
4.13.2. nowait Clause on target Construct . . . . . . . . . . . . . . . . . . . 204
4.13.3. Asynchronous target with nowait and depend Clauses . . . . . . . . 206
4.14. Device Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
4.14.1. omp_is_initial_device Routine . . . . . . . . . . . . . . . . . . . 209
4.14.2. omp_get_num_devices Routine . . . . . . . . . . . . . . . . . . . . 211
4.14.3. omp_set_default_device and
omp_get_default_device Routines . . . . . . . . . . . . . . . . . . 212
4.14.4. Target Memory and Device Pointers Routines . . . . . . . . . . . . . . . . 213
5. SIMD 215
5.1. simd and declare simd Constructs . . . . . . . . . . . . . . . . . . . . . . . 216
5.2. inbranch and notinbranch Clauses . . . . . . . . . . . . . . . . . . . . . . 223
5.3. Loop-Carried Lexical Forward Dependence . . . . . . . . . . . . . . . . . . . . . 227
5.4. ref, val, uval Modifiers for linear Clause . . . . . . . . . . . . . . . . . . . 229
6. Synchronization 237
6.1. The critical Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.2. Worksharing Constructs Inside a critical Construct . . . . . . . . . . . . . . . 242
6.3. Binding of barrier Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.4. The atomic Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.5. Restrictions on the atomic Construct . . . . . . . . . . . . . . . . . . . . . . . . 252
6.6. The flush Construct without a List . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.7. Synchronization Based on Acquire/Release Semantics . . . . . . . . . . . . . . . . 258
6.8. The ordered Clause and the ordered Construct . . . . . . . . . . . . . . . . . 266
6.9. The depobj Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.10. Doacross Loop Nest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.11. Lock Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
6.11.1. The omp_init_lock Routine . . . . . . . . . . . . . . . . . . . . . . . 280
6.11.2. The omp_init_lock_with_hint Routine . . . . . . . . . . . . . . . 281
6.11.3. Ownership of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
6.11.4. Simple Lock Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
6.11.5. Nestable Lock Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Contents v
9.9. Restrictions on Nesting of Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 413
9.10. Target Offload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
2 The OpenMP Examples document has been updated with new features found in the OpenMP 5.0
3 Specification. The additional examples and updates are referenced in the Document Revision
4 History of the Appendix on page 424.
5 Text describing an example with a 5.0 feature specifically states that the feature support begins in
6 the OpenMP 5.0 Specification. Also, an omp_5.0 keyword has been added to metadata in the
7 source code. These distinctions are presented to remind readers that a 5.0 compliant OpenMP
8 implementation is necessary to use these features in codes.
9 Examples for most of the 5.0 features are included in this document, and incremental releases will
10 become available as more feature examples and updates are submitted, and approved by the
11 OpenMP Examples Subcommittee.
vii
1 Introduction
2 This collection of programming examples supplements the OpenMP API for Shared Memory
3 Parallelization specifications, and is not part of the formal specifications. It assumes familiarity
4 with the OpenMP specifications, and shares the typographical conventions used in that document.
5 The OpenMP API specification provides a model for parallel programming that is portable across
6 shared memory architectures from different vendors. Compilers from numerous vendors support
7 the OpenMP API.
8 The directives, library routines, and environment variables demonstrated in this document allow
9 users to create and manage parallel programs while permitting portability. The directives extend the
10 C, C++ and Fortran base languages with single program multiple data (SPMD) constructs, tasking
11 constructs, device constructs, worksharing constructs, and synchronization constructs, and they
12 provide support for sharing and privatizing data. The functionality to control the runtime
13 environment is provided by library routines and environment variables. Compilers that support the
14 OpenMP API often include a command line option to the compiler that activates and allows
15 interpretation of all OpenMP directives.
16 The latest source codes for OpenMP Examples can be downloaded from the sources directory at
17 https://github.com/OpenMP/Examples. The codes for this OpenMP 5.0.1 Examples document have
18 the tag v5.0.1.
19 Complete information about the OpenMP API and a list of the compilers that support the OpenMP
20 API can be found at the OpenMP.org web site
21 http://www.openmp.org
1
1 Examples
2 The following are examples of the OpenMP API directives, constructs, and routines.
C / C++
3 A statement following a directive is compound only when necessary, and a non-compound
4 statement is indented with respect to a directive preceding it.
C / C++
5 Each example is labeled as ename.seqno.ext, where ename is the example name, seqno is the
6 sequence number in a section, and ext is the source file extension to indicate the code type and
7 source form. ext is one of the following:
8 • c – C code,
9 • cpp – C++ code,
10 • f – Fortran code in fixed form, and
11 • f90 – Fortran code in free form.
12 Some of the example labels may include version information (omp_verno) to indicate features that
13 are illustrated by an example for a specific OpenMP version, such as “scan.1.c (omp_5.0).”
2
1 CHAPTER 1
2 Parallel Execution
3 A single thread, the initial thread, begins sequential execution of an OpenMP enabled program, as
4 if the whole program is in an implicit parallel region consisting of an implicit task executed by the
5 initial thread.
6 A parallel construct encloses code, forming a parallel region. An initial thread encountering a
7 parallel region forks (creates) a team of threads at the beginning of the parallel region, and
8 joins them (removes from execution) at the end of the region. The initial thread becomes the master
9 thread of the team in a parallel region with a thread number equal to zero, the other threads are
10 numbered from 1 to number of threads minus 1. A team may be comprised of just a single thread.
11 Each thread of a team is assigned an implicit task consisting of code within the parallel region. The
12 task that creates a parallel region is suspended while the tasks of the team are executed. A thread is
13 tied to its task; that is, only the thread assigned to the task can execute that task. After completion
14 of the parallel region, the master thread resumes execution of the generating task.
15 Any task within a parallel region is allowed to encounter another parallel region to form a
16 nested parallel region. The parallelism of a nested parallel region (whether it forks
17 additional threads, or is executed serially by the encountering task) can be controlled by the
18 OMP_NESTED environment variable or the omp_set_nested() API routine with arguments
19 indicating true or false.
20 The number of threads of a parallel region can be set by the OMP_NUM_THREADS
21 environment variable, the omp_set_num_threads() routine, or on the parallel directive
22 with the num_threads clause. The routine overrides the environment variable, and the clause
23 overrides all. Use the OMP_DYNAMIC or the omp_set_dynamic() function to specify that the
24 OpenMP implementation dynamically adjust the number of threads for parallel regions. The
25 default setting for dynamic adjustment is implementation defined. When dynamic adjustment is on
26 and the number of threads is specified, the number of threads becomes an upper limit for the
27 number of threads to be provided by the OpenMP runtime.
3
1 WORKSHARING CONSTRUCTS
2 A worksharing construct distributes the execution of the associated region among the members of
3 the team that encounter it. There is an implied barrier at the end of the worksharing region (there is
4 no barrier at the beginning). The worksharing constructs are:
5 • loop constructs: for and do
6 • sections
7 • single
8 • workshare
9 The for and do constructs (loop constructs) create a region consisting of a loop. A loop controlled
10 by a loop construct is called an associated loop. Nested loops can form a single region when the
11 collapse clause (with an integer argument) designates the number of associated loops to be
12 executed in parallel, by forming a "single iteration space" for the specified number of nested loops.
13 The ordered clause can also control multiple associated loops.
14 An associated loop must adhere to a "canonical form" (specified in the Canonical Loop Form of the
15 OpenMP Specifications document) which allows the iteration count (of all associated loops) to be
16 computed before the (outermost) loop is executed. Most common loops comply with the canonical
17 form, including C++ iterators.
18 A single construct forms a region in which only one thread (any one of the team) executes the
19 region. The other threads wait at the implied barrier at the end, unless the nowait clause is
20 specified.
21 The sections construct forms a region that contains one or more structured blocks. Each block
22 of a sections directive is constructed with a section construct, and executed once by one of
23 the threads (any one) in the team. (If only one block is formed in the region, the section
24 construct, which is used to separate blocks, is not required.) The other threads wait at the implied
25 barrier at the end, unless the nowait clause is specified.
26 The workshare construct is a Fortran feature that consists of a region with a single structure
27 block (section of code). Statements in the workshare region are divided into units of work, and
28 executed (once) by threads of the team.
29 MASTER CONSTRUCT
30 The master construct is not a worksharing construct. The master region is is executed only by the
31 master thread. There is no implicit barrier (and flush) at the end of the master region; hence the
32 other threads of the team continue execution beyond code statements beyond the master region.
6 The following example is non-conforming because the matching do directive for the end do does
7 not precede the outermost loop:
8 Example fort_do.2.f
S-1 SUBROUTINE WORK(I, J)
S-2 INTEGER I,J
S-3 END SUBROUTINE WORK
S-4
S-5 SUBROUTINE DO_WRONG
S-6 INTEGER I, J
S-7
7 In the following example, the barrier at the end of the first workshare region is eliminated with a
8 nowait clause. Threads doing CC = DD immediately begin work on EE = FF when they are
9 done with CC = DD.
10 Example workshare.2.f
S-1 SUBROUTINE WSHARE2(AA, BB, CC, DD, EE, FF, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N)
S-4 REAL DD(N,N), EE(N,N), FF(N,N)
S-5
S-6 !$OMP PARALLEL
S-7 !$OMP WORKSHARE
S-8 AA = BB
S-9 CC = DD
S-10 !$OMP END WORKSHARE NOWAIT
S-11 !$OMP WORKSHARE
S-12 EE = FF
S-13 !$OMP END WORKSHARE
S-14 !$OMP END PARALLEL
S-15 END SUBROUTINE WSHARE2
1 The following example shows the use of an atomic directive inside a workshare construct. The
2 computation of SUM(AA) is workshared, but the update to R is atomic.
3 Example workshare.3.f
S-1 SUBROUTINE WSHARE3(AA, BB, CC, DD, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N)
S-4 REAL R
S-5 R=0
S-6 !$OMP PARALLEL
S-7 !$OMP WORKSHARE
S-8 AA = BB
S-9 !$OMP ATOMIC UPDATE
S-10 R = R + SUM(AA)
S-11 CC = DD
S-12 !$OMP END WORKSHARE
S-13 !$OMP END PARALLEL
S-14 END SUBROUTINE WSHARE3
4 Fortran WHERE and FORALL statements are compound statements, made up of a control part and a
5 statement part. When workshare is applied to one of these compound statements, both the
6 control and the statement parts are workshared. The following example shows the use of a WHERE
7 statement in a workshare construct.
8 Each task gets worked on in order by the threads:
9 AA = BB then
10 CC = DD then
11 EE .ne. 0 then
12 FF = 1 / EE then
13 GG = HH
14 Example workshare.4.f
S-1 SUBROUTINE WSHARE4(AA, BB, CC, DD, EE, FF, GG, HH, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N)
S-4 REAL DD(N,N), EE(N,N), FF(N,N)
S-5 REAL GG(N,N), HH(N,N)
S-6
S-7 !$OMP PARALLEL
S-8 !$OMP WORKSHARE
S-9 AA = BB
S-10 CC = DD
S-11 WHERE (EE .ne. 0) FF = 1 / EE
S-12 GG = HH
S-13 !$OMP END WORKSHARE
S-14 !$OMP END PARALLEL
S-15
S-16 END SUBROUTINE WSHARE4
1 In the following example, an assignment to a shared scalar variable is performed by one thread in a
2 workshare while all other threads in the team wait.
3 Example workshare.5.f
S-1 SUBROUTINE WSHARE5(AA, BB, CC, DD, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N)
S-4
S-5 INTEGER SHR
S-6
S-7 !$OMP PARALLEL SHARED(SHR)
S-8 !$OMP WORKSHARE
S-9 AA = BB
S-10 SHR = 1
S-11 CC = DD * SHR
S-12 !$OMP END WORKSHARE
S-13 !$OMP END PARALLEL
S-14
S-15 END SUBROUTINE WSHARE5
4 The following example contains an assignment to a private scalar variable, which is performed by
5 one thread in a workshare while all other threads wait. It is non-conforming because the private
6 scalar variable is undefined after the assignment statement.
7 Example workshare.6.f
S-1 SUBROUTINE WSHARE6_WRONG(AA, BB, CC, DD, N)
S-2 INTEGER N
S-3 REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N)
S-4
S-5 INTEGER PRI
S-6
S-7 !$OMP PARALLEL PRIVATE(PRI)
S-8 !$OMP WORKSHARE
S-9 AA = BB
S-10 PRI = 1
S-11 CC = DD * PRI
S-12 !$OMP END WORKSHARE
S-13 !$OMP END PARALLEL
2 OpenMP Affinity
3 OpenMP Affinity consists of a proc_bind policy (thread affinity policy) and a specification of
4 places ("location units" or processors that may be cores, hardware threads, sockets, etc.). OpenMP
5 Affinity enables users to bind computations on specific places. The placement will hold for the
6 duration of the parallel region. However, the runtime is free to migrate the OpenMP threads to
7 different cores (hardware threads, sockets, etc.) prescribed within a given place, if two or more
8 cores (hardware threads, sockets, etc.) have been assigned to a given place.
9 Often the binding can be managed without resorting to explicitly setting places. Without the
10 specification of places in the OMP_PLACES variable, the OpenMP runtime will distribute and bind
11 threads using the entire range of processors for the OpenMP program, according to the
12 OMP_PROC_BIND environment variable or the proc_bind clause. When places are specified,
13 the OMP runtime binds threads to the places according to a default distribution policy, or those
14 specified in the OMP_PROC_BIND environment variable or the proc_bind clause.
15 In the OpenMP Specifications document a processor refers to an execution unit that is enabled for
16 an OpenMP thread to use. A processor is a core when there is no SMT (Simultaneous
17 Multi-Threading) support or SMT is disabled. When SMT is enabled, a processor is a hardware
18 thread (HW-thread). (This is the usual case; but actually, the execution unit is implementation
19 defined.) Processor numbers are numbered sequentially from 0 to the number of cores less one
20 (without SMT), or 0 to the number HW-threads less one (with SMT). OpenMP places use the
21 processor number to designate binding locations (unless an "abstract name" is used.)
22 The processors available to a process may be a subset of the system’s processors. This restriction
23 may be the result of a wrapper process controlling the execution (such as numactl on Linux
24 systems), compiler options, library-specific environment variables, or default kernel settings. For
25 instance, the execution of multiple MPI processes, launched on a single compute node, will each
26 have a subset of processors as determined by the MPI launcher or set by MPI affinity environment
27 variables for the MPI library.
28 Threads of a team are positioned onto places in a compact manner, a scattered distribution, or onto
29 the master’s place, by setting the OMP_PROC_BIND environment variable or the proc_bind
44
1 clause to close, spread, or master, respectively. When OMP_PROC_BIND is set to FALSE no
2 binding is enforced; and when the value is TRUE, the binding is implementation defined to a set of
3 places in the OMP_PLACES variable or to places defined by the implementation if the
4 OMP_PLACES variable is not set.
5 The OMP_PLACES variable can also be set to an abstract name (threads, cores, sockets) to specify
6 that a place is either a single hardware thread, a core, or a socket, respectively. This description of
7 the OMP_PLACES is most useful when the number of threads is equal to the number of hardware
8 thread, cores or sockets. It can also be used with a close or spread distribution policy when the
9 equality doesn’t hold.
p0 p1 p2 p3 p4 p5 p6 p7
7 The following equivalent place list declarations consist of eight places (which we designate as p0 to
8 p7):
9 OMP_PLACES="{0,1},{2,3},{4,5},{6,7},{8,9},{10,11},{12,13},{14,15}"
10 or
11 OMP_PLACES="{0:2}:8:2"
2 Tasking
3 Tasking constructs provide units of work to a thread for execution. Worksharing constructs do this,
4 too (e.g. for, do, sections, and singles constructs); but the work units are tightly controlled
5 by an iteration limit and limited scheduling, or a limited number of sections or single
6 regions. Worksharing was designed with "data parallel" computing in mind. Tasking was designed
7 for "task parallel" computing and often involves non-locality or irregularity in memory access.
8 The task construct can be used to execute work chunks: in a while loop; while traversing nodes in
9 a list; at nodes in a tree graph; or in a normal loop (with a taskloop construct). Unlike the
10 statically scheduled loop iterations of worksharing, a task is often enqueued, and then dequeued for
11 execution by any of the threads of the team within a parallel region. The generation of tasks can be
12 from a single generating thread (creating sibling tasks), or from multiple generators in a recursive
13 graph tree traversals. A taskloop construct bundles iterations of an associated loop into tasks,
14 and provides similar controls found in the task construct.
15 Sibling tasks are synchronized by the taskwait construct, and tasks and their descendent tasks
16 can be synchronized by containing them in a taskgroup region. Ordered execution is
17 accomplished by specifying dependences with a depend clause. Also, priorities can be specified
18 as hints to the scheduler through a priority clause.
19 Various clauses can be used to manage and optimize task generation, as well as reduce the overhead
20 of execution and to relinquish control of threads for work balance and forward progress.
21 Once a thread starts executing a task, it is the designated thread for executing the task to
22 completion, even though it may leave the execution at a scheduling point and return later. The
23 thread is tied to the task. Scheduling points can be introduced with the taskyield construct.
24 With an untied clause any other thread is allowed to continue the task. An if clause with an
25 expression that evaluates to false results in an undeferred task, which instructs the runtime to
26 suspend the generating task until the undeferred task completes its execution. By including the data
27 environment of the generating task into the generated task with the mergeable and final
28 clauses, task generation overhead can be reduced.
67
1 A complete list of the tasking constructs and details of their clauses can be found in the Tasking
2 Constructs chapter of the OpenMP Specifications, in the OpenMP Application Programming
3 Interface section.
CHAPTER 3. TASKING 69
S-14 !$OMP TASK ! P is firstprivate by default
S-15 CALL traverse(P%right)
S-16 !$OMP END TASK
S-17 ENDIF
S-18 CALL process ( P )
S-19
S-20 END SUBROUTINE
Fortran
1 In the next example, we force a postorder traversal of the tree by adding a taskwait directive.
2 Now, we can safely assume that the left and right sons have been executed before we process the
3 current node.
C / C++
4 Example tasking.2.c (omp_3.0)
S-1 struct node {
S-2 struct node *left;
S-3 struct node *right;
S-4 };
S-5 extern void process(struct node *);
S-6 void postorder_traverse( struct node *p ) {
S-7 if (p->left)
S-8 #pragma omp task // p is firstprivate by default
S-9 postorder_traverse(p->left);
S-10 if (p->right)
S-11 #pragma omp task // p is firstprivate by default
S-12 postorder_traverse(p->right);
S-13 #pragma omp taskwait
S-14 process(p);
S-15 }
C / C++
CHAPTER 3. TASKING 71
S-17 #pragma omp single
S-18 {
S-19 node * p = head;
S-20 while (p) {
S-21 #pragma omp task
S-22 // p is firstprivate by default
S-23 process(p);
S-24 p = p->next;
S-25 }
S-26 }
S-27 }
S-28 }
C / C++
Fortran
1 Example tasking.3.f90 (omp_3.0)
S-1
S-2 MODULE LIST
S-3 TYPE NODE
S-4 INTEGER :: PAYLOAD
S-5 TYPE (NODE), POINTER :: NEXT
S-6 END TYPE NODE
S-7 CONTAINS
S-8
S-9 SUBROUTINE PROCESS(p)
S-10 TYPE (NODE), POINTER :: P
S-11 ! do work here
S-12 END SUBROUTINE
S-13
S-14 SUBROUTINE INCREMENT_LIST_ITEMS (HEAD)
S-15
S-16 TYPE (NODE), POINTER :: HEAD
S-17 TYPE (NODE), POINTER :: P
S-18 !$OMP PARALLEL PRIVATE(P)
S-19 !$OMP SINGLE
S-20 P => HEAD
S-21 DO
S-22 !$OMP TASK
S-23 ! P is firstprivate by default
S-24 CALL PROCESS(P)
S-25 !$OMP END TASK
S-26 P => P%NEXT
S-27 IF ( .NOT. ASSOCIATED (P) ) EXIT
S-28 END DO
S-29 !$OMP END SINGLE
S-30 !$OMP END PARALLEL
CHAPTER 3. TASKING 73
1 Note: There are more efficient algorithms for computing Fibonacci numbers. This classic recursion
2 algorithm is for illustrative purposes.
3 The following example demonstrates a way to generate a large number of tasks with one thread and
4 execute them with the threads in the team. While generating these tasks, the implementation may
5 reach its limit on unassigned tasks. If it does, the implementation is allowed to cause the thread
6 executing the task generating loop to suspend its task at the task scheduling point in the task
7 directive, and start executing unassigned tasks. Once the number of unassigned tasks is sufficiently
8 low, the thread may resume execution of the task generating loop.
C / C++
9 Example tasking.5.c (omp_3.0)
S-1 #define LARGE_NUMBER 10000000
S-2 double item[LARGE_NUMBER];
S-3 extern void process(double);
S-4
S-5 int main()
S-6 {
S-7 #pragma omp parallel
S-8 {
S-9 #pragma omp single
S-10 {
S-11 int i;
S-12 for (i=0; i<LARGE_NUMBER; i++)
S-13 #pragma omp task // i is firstprivate, item is shared
S-14 process(item[i]);
S-15 }
S-16 }
S-17 }
C / C++
Fortran
10 Example tasking.5.f (omp_3.0)
S-1 real*8 item(10000000)
S-2 integer i
S-3
S-4 !$omp parallel
S-5 !$omp single ! loop iteration variable i is private
S-6 do i=1,10000000
S-7 !$omp task
S-8 ! i is firstprivate, item is shared
S-9 call process(item(i))
S-10 !$omp end task
S-11 end do
S-12 !$omp end single
CHAPTER 3. TASKING 75
Fortran
1 Example tasking.6.f (omp_3.0)
S-1 real*8 item(10000000)
S-2 !$omp parallel
S-3 !$omp single
S-4 !$omp task untied
S-5 ! loop iteration variable i is private
S-6 do i=1,10000000
S-7 !$omp task ! i is firstprivate, item is shared
S-8 call process(item(i))
S-9 !$omp end task
S-10 end do
S-11 !$omp end task
S-12 !$omp end single
S-13 !$omp end parallel
S-14 end
Fortran
2 The following two examples demonstrate how the scheduling rules illustrated in Section 2.11.3 of
3 the OpenMP 4.0 specification affect the usage of threadprivate variables in tasks. A
4 threadprivate variable can be modified by another task that is executed by the same thread.
5 Thus, the value of a threadprivate variable cannot be assumed to be unchanged across a task
6 scheduling point. In untied tasks, task scheduling points may be added in any place by the
7 implementation.
8 A task switch may occur at a task scheduling point. A single thread may execute both of the task
9 regions that modify tp. The parts of these task regions in which tp is modified may be executed in
10 any order so the resulting value of var can be either 1 or 2.
C / C++
11 Example tasking.7.c (omp_3.0)
S-1
S-2 int tp;
S-3 #pragma omp threadprivate(tp)
S-4 int var;
S-5 void work()
S-6 {
S-7 #pragma omp task
S-8 {
S-9 /* do work here */
S-10 #pragma omp task
S-11 {
S-12 tp = 1;
S-13 /* do work here */
S-14 #pragma omp task
CHAPTER 3. TASKING 77
C / C++
1 Example tasking.8.c (omp_3.0)
S-1
S-2 int tp;
S-3 #pragma omp threadprivate(tp)
S-4 int var;
S-5 void work()
S-6 {
S-7 #pragma omp parallel
S-8 {
S-9 /* do work here */
S-10 #pragma omp task
S-11 {
S-12 tp++;
S-13 /* do work here */
S-14 #pragma omp task
S-15 {
S-16 /* do work here but don’t modify tp */
S-17 }
S-18 var = tp; //Value does not change after write above
S-19 }
S-20 }
S-21 }
C / C++
Fortran
2 Example tasking.8.f (omp_3.0)
S-1 module example
S-2 integer tp
S-3 !$omp threadprivate(tp)
S-4 integer var
S-5 contains
S-6 subroutine work
S-7 !$omp parallel
S-8 ! do work here
S-9 !$omp task
S-10 tp = tp + 1
S-11 ! do work here
S-12 !$omp task
S-13 ! do work here but don’t modify tp
S-14 !$omp end task
S-15 var = tp ! value does not change after write above
S-16 !$omp end task
S-17 !$omp end parallel
CHAPTER 3. TASKING 79
Fortran
1 Example tasking.9.f (omp_3.0)
S-1 module example
S-2 contains
S-3 subroutine work
S-4 !$omp task
S-5 ! Task 1
S-6 !$omp task
S-7 ! Task 2
S-8 !$omp critical
S-9 ! Critical region 1
S-10 ! do work here
S-11 !$omp end critical
S-12 !$omp end task
S-13 !$omp critical
S-14 ! Critical region 2
S-15 ! Capture data for the following task
S-16 !$omp task
S-17 !Task 3
S-18 ! do work here
S-19 !$omp end task
S-20 !$omp end critical
S-21 !$omp end task
S-22 end subroutine
S-23 end module
Fortran
2 In the following example, lock is held across a task scheduling point. However, according to the
3 scheduling restrictions, the executing thread can’t begin executing one of the non-descendant tasks
4 that also acquires lock before the task region is complete. Therefore, no deadlock is possible.
C / C++
5 Example tasking.10.c (omp_3.0)
S-1 #include <omp.h>
S-2 void work() {
S-3 omp_lock_t lock;
S-4 omp_init_lock(&lock);
S-5 #pragma omp parallel
S-6 {
S-7 int i;
S-8 #pragma omp for
S-9 for (i = 0; i < 100; i++) {
S-10 #pragma omp task
S-11 {
S-12 // lock is shared by default in the task
CHAPTER 3. TASKING 81
1 The following examples illustrate the use of the mergeable clause in the task construct. In this
2 first example, the task construct has been annotated with the mergeable clause. The addition
3 of this clause allows the implementation to reuse the data environment (including the ICVs) of the
4 parent task for the task inside foo if the task is included or undeferred. Thus, the result of the
5 execution may differ depending on whether the task is merged or not. Therefore the mergeable
6 clause needs to be used with caution. In this example, the use of the mergeable clause is safe. As x
7 is a shared variable the outcome does not depend on whether or not the task is merged (that is, the
8 task will always increment the same variable and will always compute the same value for x).
C / C++
9 Example tasking.11.c (omp_3.1)
S-1 #include <stdio.h>
S-2 void foo ( )
S-3 {
S-4 int x = 2;
S-5 #pragma omp task shared(x) mergeable
S-6 {
S-7 x++;
S-8 }
S-9 #pragma omp taskwait
S-10 printf("%d\n",x); // prints 3
S-11 }
C / C++
Fortran
10 Example tasking.11.f90 (omp_3.1)
S-1 subroutine foo()
S-2 integer :: x
S-3 x = 2
S-4 !$omp task shared(x) mergeable
S-5 x = x + 1
S-6 !$omp end task
S-7 !$omp taskwait
S-8 print *, x ! prints 3
S-9 end subroutine
Fortran
11 This second example shows an incorrect use of the mergeable clause. In this example, the
12 created task will access different instances of the variable x if the task is not merged, as x is
13 firstprivate, but it will access the same variable x if the task is merged. As a result, the
14 behavior of the program is unspecified and it can print two different values for x depending on the
15 decisions taken by the implementation.
CHAPTER 3. TASKING 83
C / C++
1 Example tasking.13.c (omp_3.1)
S-1 #include <string.h>
S-2 #include <omp.h>
S-3 #define LIMIT 3 /* arbitrary limit on recursion depth */
S-4 void check_solution(char *);
S-5 void bin_search (int pos, int n, char *state)
S-6 {
S-7 if ( pos == n ) {
S-8 check_solution(state);
S-9 return;
S-10 }
S-11 #pragma omp task final( pos > LIMIT ) mergeable
S-12 {
S-13 char new_state[n];
S-14 if (!omp_in_final() ) {
S-15 memcpy(new_state, state, pos );
S-16 state = new_state;
S-17 }
S-18 state[pos] = 0;
S-19 bin_search(pos+1, n, state );
S-20 }
S-21 #pragma omp task final( pos > LIMIT ) mergeable
S-22 {
S-23 char new_state[n];
S-24 if (! omp_in_final() ) {
S-25 memcpy(new_state, state, pos );
S-26 state = new_state;
S-27 }
S-28 state[pos] = 1;
S-29 bin_search(pos+1, n, state );
S-30 }
S-31 #pragma omp taskwait
S-32 }
C / C++
CHAPTER 3. TASKING 85
C / C++
1 Example tasking.14.c (omp_3.1)
S-1 void bar(void);
S-2
S-3 void foo ( )
S-4 {
S-5 int i;
S-6 #pragma omp task if(0) // This task is undeferred
S-7 {
S-8 #pragma omp task // This task is a regular task
S-9 for (i = 0; i < 3; i++) {
S-10 #pragma omp task // This task is a regular task
S-11 bar();
S-12 }
S-13 }
S-14 #pragma omp task final(1) // This task is a regular task
S-15 {
S-16 #pragma omp task // This task is included
S-17 for (i = 0; i < 3; i++) {
S-18 #pragma omp task // This task is also included
S-19 bar();
S-20 }
S-21 }
S-22 }
C / C++
Fortran
2 Example tasking.14.f90 (omp_3.1)
S-1 subroutine foo()
S-2 integer i
S-3 !$omp task if(.FALSE.) ! This task is undeferred
S-4 !$omp task ! This task is a regular task
S-5 do i = 1, 3
S-6 !$omp task ! This task is a regular task
S-7 call bar()
S-8 !$omp end task
S-9 enddo
S-10 !$omp end task
S-11 !$omp end task
S-12 !$omp task final(.TRUE.) ! This task is a regular task
S-13 !$omp task ! This task is included
S-14 do i = 1, 3
S-15 !$omp task ! This task is also included
S-16 call bar()
S-17 !$omp end task
CHAPTER 3. TASKING 87
1 3.2 Task Priority
2 In this example we compute arrays in a matrix through a compute_array routine. Each task has a
3 priority value equal to the value of the loop variable i at the moment of its creation. A higher
4 priority on a task means that a task is a candidate to run sooner.
5 The creation of tasks occurs in ascending order (according to the iteration space of the loop) but a
6 hint, by means of the priority clause, is provided to reverse the execution order.
C / C++
7 Example task_priority.1.c (omp_4.5)
S-1 void compute_array (float *node, int M);
S-2
S-3 void compute_matrix (float *array, int N, int M)
S-4 {
S-5 int i;
S-6 #pragma omp parallel private(i)
S-7 #pragma omp single
S-8 {
S-9 for (i=0;i<N; i++) {
S-10 #pragma omp task priority(i)
S-11 compute_array(&array[i*M], M);
S-12 }
S-13 }
S-14 }
C / C++
Fortran
8 Example task_priority.1.f90 (omp_4.5)
S-1 subroutine compute_matrix(matrix, M, N)
S-2 implicit none
S-3 integer :: M, N
S-4 real :: matrix(M, N)
S-5 integer :: i
S-6 interface
S-7 subroutine compute_array(node, M)
S-8 implicit none
S-9 integer :: M
S-10 real :: node(M)
S-11 end subroutine
S-12 end interface
S-13 !$omp parallel private(i)
S-14 !$omp single
S-15 do i=1,N
S-16 !$omp task priority(i)
CHAPTER 3. TASKING 89
1 3.3 Task Dependences
2 3.3.1 Flow Dependence
3 This example shows a simple flow dependence using a depend clause on the task construct.
C / C++
4 Example task_dep.1.c (omp_4.0)
S-1 #include <stdio.h>
S-2 int main() {
S-3 int x = 1;
S-4 #pragma omp parallel
S-5 #pragma omp single
S-6 {
S-7 #pragma omp task shared(x) depend(out: x)
S-8 x = 2;
S-9 #pragma omp task shared(x) depend(in: x)
S-10 printf("x = %d\n", x);
S-11 }
S-12 return 0;
S-13 }
C / C++
Fortran
5 Example task_dep.1.f90 (omp_4.0)
S-1 program example
S-2 integer :: x
S-3 x = 1
S-4 !$omp parallel
S-5 !$omp single
S-6 !$omp task shared(x) depend(out: x)
S-7 x = 2
S-8 !$omp end task
S-9 !$omp task shared(x) depend(in: x)
S-10 print*, "x = ", x
S-11 !$omp end task
S-12 !$omp end single
S-13 !$omp end parallel
S-14 end program
Fortran
6 The program will always print "x = 2", because the depend clauses enforce the ordering of the
7 tasks. If the depend clauses had been omitted, then the tasks could execute in any order and the
8 program and the program would have a race condition.
CHAPTER 3. TASKING 91
1 3.3.3 Output Dependence
2 This example shows an output dependence using the depend clause on the task construct.
C / C++
3 Example task_dep.3.c (omp_4.0)
S-1 #include <stdio.h>
S-2 int main() {
S-3 int x;
S-4 #pragma omp parallel
S-5 #pragma omp single
S-6 {
S-7 #pragma omp task shared(x) depend(out: x)
S-8 x = 1;
S-9 #pragma omp task shared(x) depend(out: x)
S-10 x = 2;
S-11 #pragma omp taskwait
S-12 printf("x = %d\n", x);
S-13 }
S-14 return 0;
S-15 }
C / C++
Fortran
4 Example task_dep.3.f90 (omp_4.0)
S-1 program example
S-2 integer :: x
S-3 !$omp parallel
S-4 !$omp single
S-5 !$omp task shared(x) depend(out: x)
S-6 x = 1
S-7 !$omp end task
S-8 !$omp task shared(x) depend(out: x)
S-9 x = 2
S-10 !$omp end task
S-11 !$omp taskwait
S-12 print*, "x = ", x
S-13 !$omp end single
S-14 !$omp end parallel
S-15 end program
Fortran
5 The program will always print "x = 2", because the depend clauses enforce the ordering of the
6 tasks. If the depend clauses had been omitted, then the tasks could execute in any order and the
7 program would have a race condition.
CHAPTER 3. TASKING 93
S-20 !$omp end task
S-21
S-22 !$omp end single
S-23 !$omp end parallel
S-24 end program
Fortran
1 The last two tasks are dependent on the first task. However there is no dependence between the last
2 two tasks, which may execute in any order (or concurrently if more than one thread is available).
3 Thus, the possible outputs are "x + 1 = 3. x + 2 = 4. " and "x + 2 = 4. x + 1 = 3. ". If the depend
4 clauses had been omitted, then all of the tasks could execute in any order and the program would
5 have a race condition.
CHAPTER 3. TASKING 95
1 In the first example the generating task waits at the taskwait construct for the completion of the
2 first child task because a dependence on the first task is produced by x with an in dependence type
3 within the depend clause of the taskwait construct. Immediately after the first taskwait
4 construct it is safe to access the x variable by the generating task, as shown in the print statement.
5 There is no completion restraint on the second child task. Hence, immediately after the first
6 taskwait it is unsafe to access the y variable since the second child task may still be executing.
7 The second taskwait ensures that the second child task has completed; hence it is safe to access
8 the y variable in the following print statement.
C / C++
9 Example task_dep.6.c (omp_5.0)
S-1
S-2 #include<stdio.h>
S-3
S-4 void foo()
S-5 {
S-6 int x = 0, y = 2;
S-7
S-8 #pragma omp task depend(inout: x) shared(x)
S-9 x++; // 1st child task
S-10
S-11 #pragma omp task shared(y)
S-12 y--; // 2nd child task
S-13
S-14 #pragma omp taskwait depend(in: x) // 1st taskwait
S-15
S-16 printf("x=%d\n",x);
S-17
S-18 // Second task may not be finished.
S-19 // Accessing y here will create a race condition.
S-20
S-21 #pragma omp taskwait // 2nd taskwait
S-22
S-23 printf("y=%d\n",y);
S-24 }
S-25
S-26 int main()
S-27 {
S-28 #pragma omp parallel
S-29 #pragma omp single
S-30 foo();
S-31
S-32 return 0;
S-33 }
C / C++
CHAPTER 3. TASKING 97
1 dependence on only the first child task is produced by x with an in dependence type within the
2 depend clause of the taskwait construct. The second taskwait (without a depend clause)
3 is included to guarantee completion of the second task before y is accessed. (While unnecessary,
4 the depend(inout: y) clause on the 2nd child task is included to illustrate how the child task
5 dependences can be completely annotated in a data-flow model.)
C / C++
6 Example task_dep.7.c (omp_5.0)
S-1
S-2 #include<stdio.h>
S-3
S-4 void foo()
S-5 {
S-6 int x = 0, y = 2;
S-7
S-8 #pragma omp task depend(inout: x) shared(x)
S-9 x++; // 1st child task
S-10
S-11 #pragma omp task depend(in: x) depend(inout: y) shared(x, y)
S-12 y -= x; // 2nd child task
S-13
S-14 #pragma omp taskwait depend(in: x) // 1st taskwait
S-15
S-16 printf("x=%d\n",x);
S-17
S-18 // Second task may not be finished.
S-19 // Accessing y here would create a race condition.
S-20
S-21 #pragma omp taskwait // 2nd taskwait
S-22
S-23 printf("y=%d\n",y);
S-24
S-25 }
S-26
S-27 int main()
S-28 {
S-29 #pragma omp parallel
S-30 #pragma omp single
S-31 foo();
S-32
S-33 return 0;
S-34 }
C / C++
CHAPTER 3. TASKING 99
1 The depend clause of the taskwait construct now includes an in dependence type for y.
2 Hence the generating task must now wait on completion of any child task having y with an out
3 (here inout) dependence type in its depend clause. So, the depend clause of the taskwait
4 construct now constrains the second task to complete at the taskwait, too. (This change makes
5 the second taskwait of the previous example unnecessary– it has been removed in this example.)
6 Note: While a taskwait construct ensures that all child tasks have completed; a depend clause on a
7 taskwait construct only waits for specific child tasks (prescribed by the dependence type and list
8 items in the taskwait’s depend clause). This and the previous example illustrate the need to
9 carefully determine the dependence type of variables in the taskwait depend clause when
10 selecting child tasks that the generating task must wait on, so that its execution after the taskwait
11 does not produce race conditions on variables accessed by non-completed child tasks.
C / C++
12 Example task_dep.8.c (omp_5.0)
S-1
S-2 #include<stdio.h>
S-3
S-4 void foo()
S-5 {
S-6 int x = 0, y = 2;
S-7
S-8 #pragma omp task depend(inout: x) shared(x)
S-9 x++; // 1st child task
S-10
S-11 #pragma omp task depend(in: x) depend(inout: y) shared(x, y)
S-12 y -= x; // 2st child task
S-13
S-14 #pragma omp taskwait depend(in: x,y)
S-15
S-16 printf("x=%d\n",x);
S-17 printf("y=%d\n",y);
S-18
S-19 }
S-20
S-21 int main()
S-22 {
S-23 #pragma omp parallel
S-24 #pragma omp single
S-25 foo();
S-26
S-27 return 0;
S-28 }
C / C++
2 Devices
3 The target construct consists of a target directive and an execution region. The target
4 region is executed on the default device or the device specified in the device clause.
5 In OpenMP version 4.0, by default, all variables within the lexical scope of the construct are copied
6 to and from the device, unless the device is the host, or the data exists on the device from a
7 previously executed data-type construct that has created space on the device and possibly copied
8 host data to the device storage.
9 The constructs that explicitly create storage, transfer data, and free storage on the device are
10 catagorized as structured and unstructured. The target data construct is structured. It creates a
11 data region around target constructs, and is convenient for providing persistent data throughout
12 multiple target regions. The target enter data and target exit data constructs are
13 unstructured, because they can occur anywhere and do not support a "structure" (a region) for
14 enclosing target constructs, as does the target data construct.
15 The map clause is used on target constructs and the data-type constructs to map host data. It
16 specifies the device storage and data movement to and from the device, and controls on the
17 storage duration.
18 There is an important change in the OpenMP 4.5 specification that alters the data model for scalar
19 variables and C/C++ pointer variables. The default behavior for scalar variables and C/C++ pointer
20 variables in an 4.5 compliant code is firstprivate. Example codes that have been updated to
21 reflect this new behavior are annotated with a description that describes changes required for
22 correct execution. Often it is a simple matter of mapping the variable as tofrom to obtain the
23 intended 4.0 behavior.
24 In OpenMP version 4.5 the mechanism for target execution is specified as occuring through a target
25 task. When the target construct is encountered a new target task is generated. The target task
26 completes after the target region has executed and all data transfers have finished.
27 This new specification does not affect the execution of pre-4.5 code; it is a necessary element for
28 asynchronous execution of the target region when using the new nowait clause introduced in
29 OpenMP 4.5.
120
1 4.1 target Construct
2 4.1.1 target Construct on parallel Construct
3 This following example shows how the target construct offloads a code region to a target device.
4 The variables p, v1, v2, and N are implicitly mapped to the target device.
C / C++
5 Example target.1.c (omp_4.0)
S-1 extern void init(float*, float*, int);
S-2 extern void output(float*, int);
S-3 void vec_mult(int N)
S-4 {
S-5 int i;
S-6 float p[N], v1[N], v2[N];
S-7 init(v1, v2, N);
S-8 #pragma omp target
S-9 #pragma omp parallel for private(i)
S-10 for (i=0; i<N; i++)
S-11 p[i] = v1[i] * v2[i];
S-12 output(p, N);
S-13 }
C / C++
Fortran
6 Example target.1.f90 (omp_4.0)
S-1 subroutine vec_mult(N)
S-2 integer :: i,N
S-3 real :: p(N), v1(N), v2(N)
S-4 call init(v1, v2, N)
S-5 !$omp target
S-6 !$omp parallel do
S-7 do i=1,N
S-8 p(i) = v1(i) * v2(i)
S-9 end do
S-10 !$omp end target
S-11 call output(p, N)
S-12 end subroutine
Fortran
6 In the following example, the usual Fortran approach is used for dynamic memory. The p0, v1, and
7 v2 arrays are allocated in the main program and passed as references from one routine to another. In
8 vec_mult, p1, v3 and v4 are references to the p0, v1, and v2 arrays, respectively. The target
9 construct’s device data environment inherits the arrays p0, v1, and v2 from the enclosing target data
2 SIMD
3 Single instruction, multiple data (SIMD) is a form of parallel execution in which the same operation
4 is performed on multiple data elements independently in hardware vector processing units (VPU),
5 also called SIMD units. The addition of two vectors to form a third vector is a SIMD operation.
6 Many processors have SIMD (vector) units that can perform simultaneously 2, 4, 8 or more
7 executions of the same operation (by a single SIMD unit).
8 Loops without loop-carried backward dependency (or with dependency preserved using ordered
9 simd) are candidates for vectorization by the compiler for execution with SIMD units. In addition,
10 with state-of-the-art vectorization technology and declare simd construct extensions for
11 function vectorization in the OpenMP 4.5 specification, loops with function calls can be vectorized
12 as well. The basic idea is that a scalar function call in a loop can be replaced by a vector version of
13 the function, and the loop can be vectorized simultaneously by combining a loop vectorization
14 (simd directive on the loop) and a function vectorization (declare simd directive on the
15 function).
16 A simd construct states that SIMD operations be performed on the data within the loop. A number
17 of clauses are available to provide data-sharing attributes (private, linear, reduction and
18 lastprivate). Other clauses provide vector length preference/restrictions (simdlen /
19 safelen), loop fusion (collapse), and data alignment (aligned).
20 The declare simd directive designates that a vector version of the function should also be
21 constructed for execution within loops that contain the function and have a simd directive. Clauses
22 provide argument specifications (linear, uniform, and aligned), a requested vector length
23 (simdlen), and designate whether the function is always/never called conditionally in a loop
24 (branch/inbranch). The latter is for optimizing performance.
25 Also, the simd construct has been combined with the worksharing loop constructs (for simd
26 and do simd) to enable simultaneous thread execution in different SIMD units.
215
1 5.1 simd and declare simd Constructs
2 The following example illustrates the basic use of the simd construct to assure the compiler that
3 the loop can be vectorized.
C / C++
4 Example SIMD.1.c (omp_4.0)
S-1 void star( double *a, double *b, double *c, int n, int *ioff )
S-2 {
S-3 int i;
S-4 #pragma omp simd
S-5 for ( i = 0; i < n; i++ )
S-6 a[i] *= b[i] * c[i+ *ioff];
S-7 }
C / C++
Fortran
5 Example SIMD.1.f90 (omp_4.0)
S-1 subroutine star(a,b,c,n,ioff_ptr)
S-2 implicit none
S-3 double precision :: a(*),b(*),c(*)
S-4 integer :: n, i
S-5 integer, pointer :: ioff_ptr
S-6
S-7 !$omp simd
S-8 do i = 1,n
S-9 a(i) = a(i) * b(i) * c(i+ioff_ptr)
S-10 end do
S-11
S-12 end subroutine
Fortran
Fortran
1 Example linear_modifier.2.f90 (omp_4.5)
S-1 module m
S-2 integer, parameter :: NN = 1023
S-3 integer :: a(NN)
S-4
S-5 contains
S-6 subroutine add_one2(p, i)
S-7 !$omp declare simd(add_one2) linear(ref(p)) linear(uval(i))
S-8 implicit none
S-9 integer :: p
S-10 integer, intent(in) :: i
S-11
S-12 p = p + i
S-13 end subroutine
S-14 end module
S-15
S-16 program main
S-17 use m
S-18 implicit none
S-19 integer :: i, p
S-20
S-21 do i = 1, NN
S-22 a(i) = i
S-23 end do
S-24
S-25 p = 1
S-26 !$omp simd linear(p)
S-27 do i = 1, NN
S-28 call add_one2(a(p), i)
S-29 p = p + 1
S-30 end do
S-31
S-32 do i = 1, NN
S-33 if (a(i) /= i*2) then
2 Synchronization
3 The barrier construct is a stand-alone directive that requires all threads of a team (within a
4 contention group) to execute the barrier and complete execution of all tasks within the region,
5 before continuing past the barrier.
6 The critical construct is a directive that contains a structured block. The construct allows only
7 a single thread at a time to execute the structured block (region). Multiple critical regions may exist
8 in a parallel region, and may act cooperatively (only one thread at a time in all critical regions),
9 or separately (only one thread at a time in each critical regions when a unique name is supplied
10 on each critical construct). An optional (lock) hint clause may be specified on a named
11 critical construct to provide the OpenMP runtime guidance in selection a locking mechanism.
12 On a finer scale the atomic construct allows only a single thread at a time to have atomic access to
13 a storage location involving a single read, write, update or capture statement, and a limited number
14 of combinations when specifying the capture atomic-clause clause. The atomic-clause clause is
15 required for some expression statements, but is not required for update statements. The
16 memory-order clause can be used to specify the degree of memory ordering enforced by an
17 atomic construct. From weakest to strongest, they are relaxed (the default), acquire and/or
18 release clauses (specified with acquire, release, or acq_rel), and seq_cst. Please see
19 the details in the atomic Construct subsection of the Directives chapter in the OpenMP
20 Specifications document.
21 The ordered construct either specifies a structured block in a loop, simd, or loop SIMD region
22 that will be executed in the order of the loop iterations. The ordered construct sequentializes and
23 orders the execution of ordered regions while allowing code outside the region to run in parallel.
24 Since OpenMP 4.5 the ordered construct can also be a stand-alone directive that specifies
25 cross-iteration dependences in a doacross loop nest. The depend clause uses a sink
26 dependence-type, along with a iteration vector argument (vec) to indicate the iteration that satisfies
27 the dependence. The depend clause with a source dependence-type specifies dependence
28 satisfaction.
237
1 The flush directive is a stand-alone construct for enforcing consistency between a thread’s view
2 of memory and the view of memory for other threads (see the Memory Model chapter of this
3 document for more details). When the construct is used with an explicit variable list, a strong flush
4 that forces a thread’s temporary view of memory to be consistent with the actual memory is applied
5 to all listed variables. When the construct is used without an explicit variable list and without a
6 memory-order clause, a strong flush is applied to all locally thread-visible data as defined by the
7 base language, and additionally the construct provides both acquire and release memory ordering
8 semantics. When an explicit variable list is not present and a memory-order clause is present, the
9 construct provides acquire and/or release memory ordering semantics according to the
10 memory-order clause, but no strong flush is performed. A resulting strong flush that applies to a set
11 of variables effectively ensures that no memory (load or store) operation for the affected variables
12 may be reordered across the flush directive.
13 General-purpose routines provide mutual exclusion semantics through locks, represented by lock
14 variables. The semantics allows a task to set, and hence own a lock, until it is unset by the task that
15 set it. A nestable lock can be set multiple times by a task, and is used when in code requires nested
16 control of locks. A simple lock can only be set once by the owning task. There are specific calls for
17 the two types of locks, and the variable of a specific lock type cannot be used by the other lock type.
18 Any explicit task will observe the synchronization prescribed in a barrier construct and an
19 implied barrier. Also, additional synchronizations are available for tasks. All children of a task will
20 wait at a taskwait (for their siblings to complete). A taskgroup construct creates a region in
21 which the current task is suspended at the end of the region until all sibling tasks, and their
22 descendants, have completed. Scheduling constraints on task execution can be prescribed by the
23 depend clause to enforce dependence on previously generated tasks. More details on controlling
24 task executions can be found in the Tasking Chapter in the OpenMP Specifications document.
1 Although the following example might work on some implementations, this is also non-conforming:
2 Example atomic_restrict.3.f (omp_3.1)
S-1 SUBROUTINE ATOMIC_WRONG3
S-2 INTEGER:: I
S-3 REAL:: R
S-4 EQUIVALENCE(I,R)
S-5
S-6 !$OMP PARALLEL
S-7 !$OMP ATOMIC UPDATE
S-8 I = I + 1
S-9 ! incorrect because I and R reference the same location
S-10 ! but have different types
S-11 !$OMP END PARALLEL
S-12
S-13 !$OMP PARALLEL
S-14 !$OMP ATOMIC UPDATE
S-15 R = R + 1.0
S-16 ! incorrect because I and R reference the same location
S-17 ! but have different types
S-18 !$OMP END PARALLEL
S-19
S-20 END SUBROUTINE ATOMIC_WRONG3
Fortran
Fortran
7 Example init_lock.1.f
S-1 FUNCTION NEW_LOCKS()
S-2 USE OMP_LIB ! or INCLUDE "omp_lib.h"
S-3 INTEGER(OMP_LOCK_KIND), DIMENSION(1000) :: NEW_LOCKS
S-4 INTEGER I
S-5
S-6 !$OMP PARALLEL DO PRIVATE(I)
S-7 DO I=1,1000
S-8 CALL OMP_INIT_LOCK(NEW_LOCKS(I))
S-9 END DO
S-10 !$OMP END PARALLEL DO
S-11
S-12 END FUNCTION NEW_LOCKS
Fortran
Fortran
6 Example init_lock_with_hint.1.f (omp_4.5)
S-1 FUNCTION NEW_LOCKS()
S-2 USE OMP_LIB ! or INCLUDE "omp_lib.h"
S-3 INTEGER(OMP_LOCK_KIND), DIMENSION(1000) :: NEW_LOCKS
S-4
S-5 INTEGER I
S-6
S-7 !$OMP PARALLEL DO PRIVATE(I)
S-8 DO I=1,1000
S-9 CALL OMP_INIT_LOCK_WITH_HINT(NEW_LOCKS(I),
S-10 & OMP_LOCK_HINT_CONTENDED + OMP_LOCK_HINT_SPECULATIVE)
S-11 END DO
S-12 !$OMP END PARALLEL DO
S-13
S-14 END FUNCTION NEW_LOCKS
Fortran
2 Data Environment
3 The OpenMP data environment contains data attributes of variables and objects. Many constructs
4 (such as parallel, simd, task) accept clauses to control data-sharing attributes of referenced
5 variables in the construct, where data-sharing applies to whether the attribute of the variable is
6 shared, is private storage, or has special operational characteristics (as found in the
7 firstprivate, lastprivate, linear, or reduction clause).
8 The data environment for a device (distinguished as a device data environment) is controlled on the
9 host by data-mapping attributes, which determine the relationship of the data on the host, the
10 original data, and the data on the device, the corresponding data.
11 DATA-SHARING ATTRIBUTES
12 Data-sharing attributes of variables can be classified as being predetermined, explicitly determined
13 or implicitly determined.
14 Certain variables and objects have predetermined attributes. A commonly found case is the loop
15 iteration variable in associated loops of a for or do construct. It has a private data-sharing
16 attribute. Variables with predetermined data-sharing attributes can not be listed in a data-sharing
17 clause; but there are some exceptions (mainly concerning loop iteration variables).
18 Variables with explicitly determined data-sharing attributes are those that are referenced in a given
19 construct and are listed in a data-sharing attribute clause on the construct. Some of the common
20 data-sharing clauses are: shared, private, firstprivate, lastprivate, linear, and
21 reduction.
22 Variables with implicitly determined data-sharing attributes are those that are referenced in a given
23 construct, do not have predetermined data-sharing attributes, and are not listed in a data-sharing
24 attribute clause of an enclosing construct. For a complete list of variables and objects with
25 predetermined and implicitly determined attributes, please refer to the Data-sharing Attribute Rules
26 for Variables Referenced in a Construct subsection of the OpenMP Specifications document.
289
1 DATA-MAPPING ATTRIBUTES
2 The map clause on a device construct explicitly specifies how the list items in the clause are
3 mapped from the encountering task’s data environment (on the host) to the corresponding item in
4 the device data environment (on the device). The common list items are arrays, array sections,
5 scalars, pointers, and structure elements (members).
6 Procedures and global variables have predetermined data mapping if they appear within the list or
7 block of a declare target directive. Also, a C/C++ pointer is mapped as a zero-length array
8 section, as is a C++ variable that is a reference to a pointer.
9 Without explicit mapping, non-scalar and non-pointer variables within the scope of the target
10 construct are implicitly mapped with a map-type of tofrom. Without explicit mapping, scalar
11 variables within the scope of the target construct are not mapped, but have an implicit firstprivate
12 data-sharing attribute. (That is, the value of the original variable is given to a private variable of the
13 same name on the device.) This behavior can be changed with the defaultmap clause.
14 The map clause can appear on target, target data and target enter/exit data
15 constructs. The operations of creation and removal of device storage as well as assignment of the
16 original list item values to the corresponding list items may be complicated when the list item
17 appears on multiple constructs or when the host and device storage is shared. In these cases the
18 item’s reference count, the number of times it has been referenced (+1 on entry and -1 on exited) in
19 nested (structured) map regions and/or accumulative (unstructured) mappings, determines the
20 operation. Details of the map clause and reference count operation are specified in the map Clause
21 subsection of the OpenMP Specifications document.
1 a = 11 12 13
2 ptr = 4
3 i = 15
4 A is not allocated
5 ptr = 4
6 i = 5
7 or
8 A is not allocated
9 ptr = 4
10 i = 15
11 a = 1 2 3
12 ptr = 4
13 i = 5
14 The following is an example of the use of threadprivate for module variables:
15 Example threadprivate.6.f
S-1 MODULE INC_MODULE_GOOD3
S-2 REAL, POINTER :: WORK(:)
S-3 SAVE WORK
S-4 !$OMP THREADPRIVATE(WORK)
S-5 END MODULE INC_MODULE_GOOD3
S-6
S-7 SUBROUTINE SUB1(N)
S-8 USE INC_MODULE_GOOD3
S-9 !$OMP PARALLEL PRIVATE(THE_SUM)
S-10 ALLOCATE(WORK(N))
S-11 CALL SUB2(THE_SUM)
S-12 WRITE(*,*)THE_SUM
S-13 !$OMP END PARALLEL
S-14 END SUBROUTINE SUB1
S-15
S-16 SUBROUTINE SUB2(THE_SUM)
S-17 USE INC_MODULE_GOOD3
S-18 WORK(:) = 10
S-19 THE_SUM=SUM(WORK)
S-20 END SUBROUTINE SUB2
S-21
S-22 PROGRAM INC_GOOD3
S-23 N = 10
S-24 CALL SUB1(N)
5 The following example illustrates the use of threadprivate for static class members. The
6 threadprivate directive for a static class member must be placed inside the class definition.
7 Example threadprivate.5.cpp
S-1 class T {
S-2 public:
S-3 static int i;
S-4 #pragma omp threadprivate(i)
S-5 };
C++
1 Note however that the use of shared loop iteration variables can easily lead to race conditions.
Fortran
3 The following example is non-conforming because a common block may not be declared both
4 shared and private:
5 Example fort_sp_common.5.f
S-1 SUBROUTINE COMMON_WRONG2()
S-2 COMMON /C/ X,Y
S-3 ! Incorrect: common block C cannot be declared both
S-4 ! shared and private
S-5 !$OMP PARALLEL PRIVATE (/C/), SHARED(/C/)
S-6 ! do work here
S-7 !$OMP END PARALLEL
S-8
S-9 END SUBROUTINE COMMON_WRONG2
Fortran
6 Example fort_sa_private.2.f
S-1 PROGRAM PRIV_RESTRICT2
S-2 COMMON /BLOCK2/ X
S-3 X = 1.0
S-4
S-5 !$OMP PARALLEL PRIVATE (X)
S-6 X = 2.0
S-7 CALL SUB()
S-8 !$OMP END PARALLEL
S-9
S-10 CONTAINS
S-11
S-12 SUBROUTINE SUB()
S-13 COMMON /BLOCK2/ Y
S-14
S-15 PRINT *,X ! X is undefined
S-16 PRINT *,Y ! Y is undefined
S-17 END SUBROUTINE SUB
S-18
S-19 END PROGRAM PRIV_RESTRICT2
7 Example fort_sa_private.3.f
1 Example fort_sa_private.4.f
S-1 PROGRAM PRIV_RESTRICT4
S-2 INTEGER I, J
S-3 INTEGER A(100), B(100)
S-4 EQUIVALENCE (A(51), B(1))
S-5
S-6 !$OMP PARALLEL DO DEFAULT(PRIVATE) PRIVATE(I,J) LASTPRIVATE(A)
S-7 DO I=1,100
S-8 DO J=1,100
S-9 B(J) = J - 1
S-10 ENDDO
S-11
S-12 DO J=1,100
S-13 A(J) = J ! B becomes undefined at this point
S-14 ENDDO
S-15
S-16 DO J=1,50
S-17 B(J) = B(J) + 1 ! B is undefined
S-18 ! A becomes undefined at this point
S-19 ENDDO
S-20 ENDDO
S-21 !$OMP END PARALLEL DO ! The LASTPRIVATE write for A has
S-22 ! undefined results
S-23
S-24 PRINT *, B ! B is undefined since the LASTPRIVATE
S-25 ! write of A was not defined
S-26 END PROGRAM PRIV_RESTRICT4
2 Example fort_sa_private.5.f
S-1 SUBROUTINE SUB1(X)
S-2 DIMENSION X(10)
S-3
S-4 ! This use of X does not conform to the
1 The following program is non-conforming because the reduction is on the intrinsic procedure name
2 MAX but that name has been redefined to be the variable named MAX.
3 Example reduction.3.f90
S-1 PROGRAM REDUCTION_WRONG
S-2 MAX = HUGE(0)
S-3 M = 0
S-4
S-5 !$OMP PARALLEL DO REDUCTION(MAX: M)
S-6 ! MAX is no longer the intrinsic so this is non-conforming
S-7 DO I = 1, 100
S-8 CALL SUB(M,I)
S-9 END DO
S-10
S-11 END PROGRAM REDUCTION_WRONG
S-12
S-13 SUBROUTINE SUB(M,I)
S-14 M = MAX(M,I)
S-15 END SUBROUTINE SUB
4 The following conforming program performs the reduction using the intrinsic procedure name MAX
5 even though the intrinsic MAX has been renamed to REN.
6 Example reduction.4.f90
S-1 MODULE M
S-2 INTRINSIC MAX
S-3 END MODULE M
S-4
S-5 PROGRAM REDUCTION3
S-6 USE M, REN => MAX
S-7 N = 0
S-8 !$OMP PARALLEL DO REDUCTION(REN: N) ! still does MAX
S-9 DO I = 1, 100
S-10 N = MAX(N,I)
S-11 END DO
S-12 END PROGRAM REDUCTION3
7 The following conforming program performs the reduction using intrinsic procedure name MAX
8 even though the intrinsic MAX has been renamed to MIN.
9 Example reduction.5.f90
2 The following examples shows how user-defined reductions can be defined for some STL
3 containers. The first declare reduction defines the plus (+) operation for std::vector<int> by
4 making use of the std::transform algorithm. The second and third define the merge (or
5 concatenation) operation for std::vector<int> and std::list<int>. It shows how the user-defined
6 reduction operation can be applied to specific data types of an STL.
C++
7 Example udr.6.cpp (omp_4.0)
S-1 #include <algorithm>
S-2 #include <list>
S-3 #include <vector>
S-4
S-5 #pragma omp declare reduction( + : std::vector<int> : \
S-6 std::transform (omp_out.begin(), omp_out.end(), \
S-7 omp_in.begin(), omp_in.end(),std::plus<int>()))
S-8
S-9 #pragma omp declare reduction( merge : std::vector<int> : \
S-10 omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
S-11
S-12 #pragma omp declare reduction( merge : std::list<int> : \
S-13 omp_out.merge(omp_in))
C++
2 Note that the effect of the copyprivate clause on a variable with the allocatable attribute
3 is different than on a variable with the pointer attribute. The value of A is copied (as if by
4 intrinsic assignment) and the pointer B is copied (as if by pointer assignment) to the corresponding
5 list items in the other implicit tasks belonging to the parallel region.
6 Example copyprivate.4.f
S-1 SUBROUTINE S(N)
S-2 INTEGER N
S-3
S-4 REAL, DIMENSION(:), ALLOCATABLE :: A
S-5 REAL, DIMENSION(:), POINTER :: B
S-6
S-7 ALLOCATE (A(N))
S-8 !$OMP SINGLE
S-9 ALLOCATE (B(N))
S-10 READ (11) A,B
S-11 !$OMP END SINGLE COPYPRIVATE(A,B)
S-12 ! Variable A is private and is
S-13 ! assigned the same value in each thread
S-14 ! Variable B is shared
S-15
S-16 !$OMP BARRIER
S-17 !$OMP SINGLE
S-18 DEALLOCATE (B)
S-19 !$OMP END SINGLE NOWAIT
S-20 END SUBROUTINE S
Fortran
12 The following example illustrates the effect of specifying a selector name on a data-sharing
13 attribute clause. The associate name u is associated with v and the variable v is specified on the
14 private clause of the parallel construct. The construct association is established prior to the
15 parallel region. The association between u and the original v is retained (see the Data Sharing
16 Attribute Rules section in the OpenMP 4.0 API Specifications). Inside the parallel region, v
17 has the value of -1 and u has the value of the original v.
18 Example associate.3.f90 (omp_4.0)
2 Memory Model
3 OpenMP provides a shared-memory model that allows all threads on a given device shared access
4 to memory. For a given OpenMP region that may be executed by more than one thread or SIMD
5 lane, variables in memory may be shared or private with respect to those threads or SIMD lanes. A
6 variable’s data-sharing attribute indicates whether it is shared (the shared attribute) or private (the
7 private, firstprivate, lastprivate, linear, and reduction attributes) in the data environment of an
8 OpenMP region. While private variables in an OpenMP region are new copies of the original
9 variable (with same name) that may then be concurrently accessed or modified by their respective
10 threads or SIMD lanes, a shared variable in an OpenMP region is the same as the variable of the
11 same name in the enclosing region. Concurrent accesses or modifications to a shared variable may
12 therefore require synchronization to avoid data races.
13 OpenMP’s memory model also includes a temporary view of memory that is associated with each
14 thread. Two different threads may see different values for a given variable in their respective
15 temporary views. Threads may employ flush operations for the purposes of making their temporary
16 view of a variable consistent with the value of the variable in memory. The effect of a given flush
17 operation is characterized by its flush properties – some combination of strong, release, and
18 acquire – and, for strong flushes, a flush-set.
19 A strong flush will force consistency between the temporary view and the memory for all variables
20 in its flush-set. Furthermore all strong flushes in a program that have intersecting flush-sets will
21 execute in some total order, and within a thread strong flushes may not be reordered with respect to
22 other memory operations on variables in its flush-set. Release and acquire flushes operate in pairs.
23 A release flush may “synchronize” with an acquire flush, and when it does so the local memory
24 operations that precede the release flush will appear to have been completed before the local
25 memory operations on the same variables that follow the acquire flush.
26 Flush operations arise from explicit flush directives, implicit flush directives, and also from
27 the execution of atomic constructs. The flush directive forces a consistent view of local
28 variables of the thread executing the flush. When a list is supplied on the directive, only the items
29 (variables) in the list are guaranteed to be flushed. Implied flushes exist at prescribed locations of
367
1 certain constructs. For the complete list of these locations and associated constructs, please refer to
2 the flush Construct section of the OpenMP Specifications document.
3 In this chapter, examples illustrate how race conditions may arise for accesses to variables with a
4 shared data-sharing attribute when flush operations are not properly employed. A race condition
5 can exist when two or more threads are involved in accessing a variable and at least one of the
6 accesses modifies the variable. In particular, a data race will arise when conflicting accesses do not
7 have a well-defined completion order. The existence of data races in OpenMP programs result in
8 undefined behavior, and so they should generally be avoided for programs to be correct. The
9 completion order of accesses to a shared variable is guaranteed in OpenMP through a set of
10 memory consistency rules that are described in the OpenMP Memory Consitency section of the
11 OpenMP Specifications document.
2 Program Control
3 Some specific and elementary concepts of controlling program execution are illustrated in the
4 examples of this chapter. Control can be directly managed with conditional control code (ifdef’s
5 with the _OPENMP macro, and the Fortran sentinel (!$) for conditionally compiling). The if
6 clause on some constructs can direct the runtime to ignore or alter the behavior of the construct. Of
7 course, the base-language if statements can be used to control the "execution" of stand-alone
8 directives (such as flush, barrier, taskwait, and taskyield). However, the directives
9 must appear in a block structure, and not as a substatement as shown in examples 1 and 2 of this
10 chapter.
11 CANCELLATION
12 Cancellation (termination) of the normal sequence of execution for the threads in an OpenMP
13 region can be accomplished with the cancel construct. The construct uses a
14 construct-type-clause to set the region-type to activate for the cancellation. That is, inclusion of one
15 of the construct-type-clause names parallel, for, do, sections or taskgroup on the
16 directive line activates the corresponding region. The cancel construct is activated by the first
17 encountering thread, and it continues execution at the end of the named region. The cancel
18 construct is also a cancellation point for any other thread of the team to also continue execution at
19 the end of the named region.
20 Also, once the specified region has been activated for cancellation any thread that encounnters a
21 cancellation point construct with the same named region (construct-type-clause),
22 continues execution at the end of the region.
23 For an activated cancel taskgroup construct, the tasks that belong to the taskgroup set of the
24 innermost enclosing taskgroup region will be canceled.
25 A task that encounters the cancel taskgroup construct continues execution at the end of its task
26 region. Any task of the taskgroup that has already begun execution will run to completion, unless it
27 encounters a cancellation point; tasks that have not begun execution "may" be discarded as
28 completed tasks.
380
1 CONTROL VARIABLES
2 Internal control variables (ICV) are used by implementations to hold values which control the
3 execution of OpenMP regions. Control (and hence the ICVs) may be set as implementation
4 defaults, or set and adjusted through environment variables, clauses, and API functions. Many of
5 the ICV control values are accessible through API function calls. Also, initial ICV values are
6 reported by the runtime if the OMP_DISPLAY_ENV environment variable has been set to TRUE.
7 NESTED CONSTRUCTS
8 Certain combinations of nested constructs are permitted, giving rise to a combined construct
9 consisting of two or more constructs. These can be used when the two (or several) constructs would
10 be used immediately in succession (closely nested). A combined construct can use the clauses of
11 the component constructs without restrictions. A composite construct is a combined construct
12 which has one or more clauses with (an often obviously) modified or restricted meaning, relative to
13 when the constructs are uncombined.
14 Certain nestings are forbidden, and often the reasoning is obvious. Worksharing constructs cannot
15 be nested, and the barrier construct cannot be nested inside a worksharing construct, or a
16 critical construct. Also, target constructs cannot be nested.
17 The parallel construct can be nested, as well as the task construct. The parallel execution in
18 the nested parallel construct(s) is control by the OMP_NESTED and
19 OMP_MAX_ACTIVE_LEVELS environment variables, and the omp_set_nested() and
20 omp_set_max_active_levels() functions.
21 More details on nesting can be found in the Nesting of Regions of the Directives chapter in the
22 OpenMP Specifications document.
1 The following example illustrates the use of the cancel construct in error handling. If there is an
2 error condition from the allocate statement, the cancellation is activated. The encountering
3 thread sets the shared variable err and other threads of the binding thread set proceed to the end of
4 the worksharing construct after the cancellation has been activated.
Fortran
5 Example cancellation.1.f90 (omp_4.0)
S-1 subroutine example(n, dim)
S-2 integer, intent(in) :: n, dim(n)
S-3 integer :: i, s, err
S-4 real, allocatable :: B(:)
S-5 err = 0
S-6 !$omp parallel shared(err)
S-7 ! ...
S-8 !$omp do private(s, B)
S-9 do i=1, n
S-10 !$omp cancellation point do
S-11 allocate(B(dim(i)), stat=s)
S-12 if (s .gt. 0) then
S-13 !$omp atomic write
S-14 err = s
S-15 !$omp cancel do
S-16 endif
S-17 ! ...
S-18 ! deallocate private array B
S-19 if (allocated(B)) then
S-20 deallocate(B)
S-21 endif
S-22 enddo
S-23 !$omp end parallel
S-24 end subroutine
Fortran
Fortran
1 Example requires.1.f90 (omp_5.0)
S-1
S-2 module data
S-3 !$omp requires unified_shared_memory
S-4 type,public :: mypoints
S-5 double precision :: res
S-6 double precision :: data(500)
S-7 end type
S-8 end module
S-9
S-10 program main
S-11 use data
S-12 type(mypoints) :: p
S-13 integer :: q=0
S-14
S-15 !$omp target !! no map clauses needed
S-16 q = q + 1 !! q is firstprivate
S-17 call do_something_with_p(p,q)
S-18 !$omp end target
S-19
S-20 write(*,’(f5.0,i5)’) p%res, q !! output 1. 0
S-21
S-22 end program
S-23
S-24 subroutine do_something_with_p(p,q)
S-25 use data
S-26 type(mypoints) :: p
S-27 integer :: q
S-28
S-29 p%res = q;
S-30 do i=1,size(p%data)
S-31 p%data(i)=q*i
S-32 enddo
S-33
S-34 end subroutine
Fortran
424
1 A.2 Changes from 4.5.0 to 5.0.0
2 • Added the following examples for the 5.0 features:
3 – Extended teams construct for host execution (Section 1.3 on page 8)
4 – loop and teams loop constructs specify loop iterations that can execute concurrently
5 (Section 1.15 on page 38)
6 – Task data affinity is indicated by affinity clause of task construct (Section 2.2 on
7 page 52)
8 – Display thread affinity with OMP_DISPLAY_AFFINITY environment variable or
9 omp_display_affinity() API routine (Section 2.3 on page 53)
10 – taskwait with dependences (Section 3.3.6 on page 95)
11 – mutexinoutset task dependences (Section 3.3.7 on page 102)
12 – Multidependence Iterators (in depend clauses) (Section 3.3.8 on page 105)
13 – Combined constructs: parallel master taskloop and
14 parallel master taskloop simd (Section 3.7 on page 118)
15 – Reverse Offload through ancestor modifier of device clause. (Section 4.1.6 on page 129)
16 – Pointer Mapping - behavior of mapped pointers (Section 4.3 on page 137)
17 – Structure Mapping - behavior of mapped structures (Section 4.4 on page 143)
18 – Array Shaping with the shape-operator (Section 4.6 on page 150)
19 – The declare mapper construct (Section 4.7 on page 153)
20 – Acquire and Release Semantics Synchronization: Memory ordering clauses acquire,
21 release, and acq_rel were added to flush and atomic constructs (Section 6.7 on page 258)
22 – depobj construct provides dependence objects for subsequent use in depend clauses
23 (Section 6.9 on page 270)
24 – reduction clause for task construct (Section 7.9.2 on page 322)
25 – reduction clause for taskloop construct (Section 7.9.5 on page 337)
26 – reduction clause for taskloop simd construct (Section 7.9.5 on page 337)
27 – Memory Allocators for making OpenMP memory requests with traits (Section 8.2 on
28 page 376)
29 – requires directive specifies required features of implementation (Section 9.5 on page 395)
30 – declare variant directive - for function variants (Section 9.6 on page 397)
31 – metadirective directive - for directive variants (Section 9.7 on page 403)