0% found this document useful (0 votes)
18 views155 pages

OpenMP Workshop Day 2

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views155 pages

OpenMP Workshop Day 2

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

Tasking Motivation

8 8Advanced OpenMP
Sudoko for Lazy Computer Scientists
◼ Lets solve Sudoku puzzles with brute multi-core force
◼ (1) Search an empty field

◼ (2) Try all numbers:


◼ (2 a) Check Sudoku
◼ If invalid: skip
◼ If valid: Go to next field

◼ Wait for completion

Advanced OpenMP
9
Parallel Brute-force Sudoku
◼ This parallel algorithm finds all valid solutions
◼ (1) Search an empty fieldfirst call contained in a
#pragma omp parallel
#pragma omp single
such that one tasks starts the
◼ (2) Try all numbers: execution of the algorithm
◼ (2 a) Check Sudoku
◼ If invalid: skip
◼ If valid: Go to next#pragma
field omp task
needs to work on a new copy
of the Sudoku board

◼ Wait for completion #pragma omp taskwait


wait for all child tasks

Advanced OpenMP
10
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding

8 4.0

7 3.5

Is this the best


Runtime [sec] for 16x16

6 3.0

5 2.5 we can can do?

Speedup
4 2.0

3 1.5

2 1.0

1 0.5

0 0.0
1 2 3 4 5 6 7 8 9 10 11 12 16 24 32
#threads

Advanced OpenMP
11
Tasking Overview

Advanced OpenMP
1
What is a task in OpenMP?
◼ Tasks are work units whose execution
→ may be deferred or…

→ … can be executed immediately


◼ Tasks are composed of
→ code to execute, a data environment (initialized at creation time), internal control variables (ICVs)
◼ Tasks are created…
… when reaching a parallel region → implicit tasks are created (per thread)

… when encountering a task construct → explicit task is created

… when encountering a taskloop construct → explicit tasks per chunk are created

… when encountering a target construct → target task is created

Advanced OpenMP
2
Tasking execution model
◼ Supports unstructured parallelism ◼ Example (unstructured parallelism)
→ unbounded loops #pragma omp parallel
#pragma omp master
while ( <expr> ) {
while (elem != NULL) {
...
#pragma omp task
}
compute(elem);
elem = elem->next;
→ recursive functions }
void myfunc( <args> )
{
...; myfunc( <newargs> ); ...;
} Parallel Team

◼ Several scenarios are possible: Task pool

→ single creator, multiple creators, nested tasks (tasks & WS)


◼ All threads in the team are candidates to execute tasks

Advanced OpenMP
3
The task construct
◼ Deferring (or not) a unit of work (executable for any member of the team)
#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]
{structured-block} …structured-block…
!$omp end task

◼ Where clause is one of:


→ private(list) → if(scalar-expression)

→ firstprivate(list) → mergeable Cutoff Strategies

→ shared(list) Data Environment → final(scalar-expression)

→ default(shared | none) → depend(dep-type: list) Synchronization

→ in_reduction(r-id: list) → untied

→ allocate([allocator:] list) → priority(priority-value) Task Scheduling


Miscellaneous
→ detach(event-handler) → affinity(list)

Advanced OpenMP
4
Task scheduling: tied vs untied tasks
◼ Tasks are tied by default (when no untied clause present)
→ tied tasks are executed always by the same thread (not necessarily creator)

→ tied tasks may run into performance problems


◼ Programmers may specify tasks to be untied (relax scheduling)
#pragma omp task untied
{structured-block}

→ can potentially switch to any thread (of the team)

→ bad mix with thread based features: thread-id, threadprivate, critical regions...

→ gives the runtime more flexibility to schedule tasks

→ but most of OpenMP implementations doesn’t “honor” untied 

Advanced OpenMP
5
Task scheduling: taskyield directive
◼ Task scheduling points (and the taskyield directive)
→ tasks can be suspended/resumed at TSPs → some additional constraints to avoid deadlock problems

→ implicit scheduling points (creation, synchronization, ... )

→ explicit scheduling point: the taskyield directive


#pragma omp taskyield

◼ Scheduling [tied/untied] tasks: example tied: foo() bar() (default)


#pragma omp parallel single
#pragma omp single
{
#pragma omp task untied
{
foo(); untied: foo()
#pragma omp taskyield
single
bar()
} bar()
}
Advanced OpenMP
6
Task scheduling: programmer’s hints
◼ Programmers may specify a priority value when creating a task
#pragma omp task priority(pvalue)
{structured-block}

→ pvalue: the higher → the best (will be scheduled earlier)

→ once a thread becomes idle, gets one of the highest priority tasks

#pragma omp parallel


#pragma omp single
{
for ( i = 0; i < SIZE; i++) { Parallel Team
#pragma omp task priority(1)
{ code_A; }
} Task pool
priority-aware
#pragma omp task priority(100)
{ code_B; }
...
}

Advanced OpenMP
7
Task synchronization: taskwait directive
◼ The taskwait directive (shallow task synchronization)
→ It is a stand-alone directive
#pragma omp taskwait

→ wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)

#pragma omp parallel


#pragma omp single A
{
#pragma omp task :A
{
#pragma omp task : B
{ … } wait for… B C
#pragma omp task : C
{ … #C.1; #C.2; …}
#pragma omp taskwait
C.1 C.2
}
} // implicit barrier will wait for C.x
Advanced OpenMP
8
Task synchronization: barrier semantics
◼ OpenMP barrier (implicit or explicit)
→ All tasks created by any thread of the current team are guaranteed to be completed at barrier exit
#pragma omp barrier

→ And all other implicit barriers at parallel, sections, for, single, etc…

Advanced OpenMP
9
Task synchronization: taskgroup construct
◼ The taskgroup construct (deep task synchronization)
→ attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}

→ where clause (could only be): reduction(reduction-identifier: list-items)

#pragma omp parallel


#pragma omp single A
{
#pragma omp taskgroup :A
{
#pragma omp task :B
{ … } B C
#pragma omp task :C
{ … #C.1; #C.2; …} wait for…
C.1 C.2
} // end of taskgroup
}
Advanced OpenMP
10
Data Environment

Advanced OpenMP
11
Explicit data-sharing clauses
◼ Explicit data-sharing clauses (shared, private and firstprivate)
#pragma omp task shared(a) #pragma omp task private(b) #pragma omp task firstprivate(c)
{ { {
// Scope of a: shared // Scope of b: private // Scope of c: firstprivate
} } }

◼ If default clause present, what the clause says


→ shared: data which is not explicitly included in any other data sharing clause will be shared

→ none: compiler will issue an error if the attribute is not explicitly set by the programmer (very useful!!!)
#pragma omp task default(shared) #pragma omp task default(none)
{ {
// Scope of all the references, not explicitly // Compiler will force to specify the scope for
// included in any other data sharing clause, // every single variable referenced in the context
// and with no pre-determined attribute: shared }
}
Hint: Use default(none) to be forced to think about every
variable if you do not see clearly.

Advanced OpenMP
12
Pre-determined data-sharing attributes
◼ threadprivate variables are threadprivate (1) #pragma omp task 5
{
◼ dynamic storage duration objects are shared (malloc, new,… ) (2) int x = MN;
◼ static data members are shared (3) // Scope of x: private
}
◼ variables declared inside the construct
→ static storage duration variables are shared (4) #pragma omp task 4
{
static int y;
→ automatic storage duration variables are private (5) // Scope of y: shared
}
◼ the loop iteration variable(s)…

int A[SIZE]; 1 int *p; 2 void foo(void){ 3


#pragma omp threadprivate(A) static int s = MN;
p = malloc(sizeof(float)*SIZE); }
// ...
#pragma omp task #pragma omp task #pragma omp task
{ { {
// A: threadprivate // *p: shared foo(); // s@foo(): shared
} } }

Advanced OpenMP
13
Implicit data-sharing attributes (in-practice)
◼ Implicit data-sharing rules for the task region → Pre-determined rules (could not change)
→ the shared attribute is lexically inherited → Explicit data-sharing clauses (+ default)
→ in any other case the variable is firstprivate → Implicit data-sharing rules

int a = 1; ◼ (in-practice) variable values within the task:


void foo() {
int b = 2, c = 3; → value of a: 1
#pragma omp parallel private(b)
{ → value of b: x // undefined (undefined in parallel)
int d = 4;
#pragma omp task → value of c: 3
{
int e = 5; → value of d: 4
// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
→ value of e: 5
// Scope of d: firstprivate
// Scope of e: private
}
}
}

Advanced OpenMP
14
Task reductions (using taskgroup)
int res = 0;
◼ Reduction operation node_t* node = NULL;
→ perform some forms of recurrence calculations ...
#pragma omp parallel
→ associative and commutative operators {
#pragma omp single
◼ The (taskgroup) scoping reduction clause {
#pragma omp taskgroup task_reduction(op: list) #pragma omp taskgroup task_reduction(+: res)
{ // [1]
{structured-block}
while (node) {
#pragma omp task in_reduction(+: res) \
→ Register a new reduction at [1] firstprivate(node)
{ // [2]
→ Computes the final result after [3] res += node->value;
◼ The (task) in_reduction clause [participating] }
node = node->next;
#pragma omp task in_reduction(op: list) }
{structured-block} } // [3]
}
→ Task participates in a reduction operation [2] }

Advanced OpenMP
15
Task reductions (+ modifiers)
int res = 0;
◼ Reduction modifiers node_t* node = NULL;
→ Former reductions clauses have been extended ...
#pragma omp parallel reduction(task,+: res)
→ task modifier allows to express task reductions { // [1][2]
#pragma omp single
→ Registering a new task reduction [1] {
#pragma omp taskgroup
→ Implicit tasks participate in the reduction [2] {
while (node) {
→ Compute final result after [4] #pragma omp task in_reduction(+: res) \
firstprivate(node)
◼ The (task) in_reduction clause [participating]
{ // [3]
#pragma omp task in_reduction(op: list) res += node->value;
{structured-block} }
node = node->next;
}
→ Task participates in a reduction operation [3] }
}
} // [4]

Advanced OpenMP
16
Tasking illustrated

Advanced OpenMP
17
Fibonacci illustrated
1 int main(int argc, 14 int fib(int n) {
2 char* argv[]) 15 if (n < 2) return n;
3 { 16 int x, y;
4 [...] 17 #pragma omp task shared(x)
5 #pragma omp parallel 18 {
6 { 19 x = fib(n - 1);
7 #pragma omp single 20 }
8 { 21 #pragma omp task shared(y)
9 fib(input); 22 {
10 } 23 y = fib(n - 2);
11 } 24 }
12 [...] 25 #pragma omp taskwait
13 } 26 return x+y;
27 }

◼ Only one Task / Thread enters fib() from main(), it is responsible for
creating the two initial work tasks
◼ Taskwait is required, as otherwise x and y would get lost
Advanced OpenMP
18
◼ T1 enters fib(4)
◼ T1 creates tasks for
fib(3) and fib(2) fib(4)
◼ T1 and T2 execute tasks
from the queue
◼ T1 and T2 create 4 new fib(3) fib(2)
tasks
◼ T1 - T4 execute tasks

fib(2) fib(1) fib(1) fib(0)

Task Queue

fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)

Advanced OpenMP
19
◼ T1 enters fib(4)
◼ T1 creates tasks for
fib(3) and fib(2) fib(4)
◼ T1 and T2 execute tasks
from the queue
◼ T1 and T2 create 4 new fib(3) fib(2)
tasks
◼ T1 - T4 execute tasks
◼ … fib(2) fib(1) fib(1) fib(0)

fib(1) fib(0)

Advanced OpenMP
20
The taskloop Construct

Advanced OpenMP
1
Tasking use case: saxpy (taskloop)
for ( i = 0; i<SIZE; i+=1) {
A[i]=A[i]*B[i]*S;
◼ Difficult to determine grain
} → 1 single iteration → to fine
for ( i = 0; i<SIZE; i+=TS) { → whole loop → no parallelism
UB = SIZE < (i+TS)?SIZE:i+TS;
for ( ii=i; ii<UB; ii++) { ◼ Manually transform the code
A[ii]=A[ii]*B[ii]*S;
→ blocking techniques
}
} ◼ Improving programmability
→ OpenMP taskloop
#pragma omp parallel
#pragma omp single #pragma omp taskloop grainsize(TS)
for ( i = 0; i<SIZE; i+=TS) { for ( i = 0; i<SIZE; i+=1) {
UB = SIZE < (i+TS)?SIZE:i+TS; A[i]=A[i]*B[i]*S;
#pragma omp task private(ii) \ }
firstprivate(i,UB) shared(S,A,B)
for ( ii=i; ii<UB; ii++) { → Hiding the internal details
A[ii]=A[ii]*B[ii]*S;
} → Grain size ~ Tile size (TS) → but implementation
}
decides exact grain size
Advanced OpenMP
2
The taskloop Construct
◼ Task generating construct: decompose a loop into chunks, create a task for each loop chunk
#pragma omp taskloop [clause[[,] clause]…] !$omp taskloop [clause[[,] clause]…]
{structured-for-loops} …structured-do-loops…
!$omp end taskloop

◼ Where clause is one of:


→ shared(list) → if(scalar-expression)
→ private(list) → final(scalar-expression) Cutoff Strategies
→ firstprivate(list) → mergeable
→ lastprivate(list) Data Environment → untied
Scheduler (R/H)
→ default(sh | pr | fp | none) → priority(priority-value)
→ reduction(r-id: list) → collapse(n)
→ in_reduction(r-id: list) → nogroup Miscellaneous
→ grainsize(grain-size) → allocate([allocator:] list)
Chunks/Grain
→ num_tasks(num-tasks)
Advanced OpenMP
3
Worksharing vs. taskloop constructs (1/2)
subroutine worksharing subroutine taskloop
integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)

!$omp do !$omp taskloop


do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop

!$omp end parallel !$omp end parallel


write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

Advanced OpenMP
4
Worksharing vs. taskloop constructs (2/2)
subroutine worksharing subroutine taskloop
integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

Advanced OpenMP
5
Taskloop decomposition approaches
◼ Clause: grainsize(grain-size) ◼ Clause: num_tasks(num-tasks)
→ Chunks have at least grain-size iterations → Create num-tasks chunks

→ Chunks have maximum 2x grain-size iterations → Each chunk must have at least one iteration
int TS = 4 * 1024; int NT = 4 * omp_get_num_threads();
#pragma omp taskloop grainsize(TS) #pragma omp taskloop num_tasks(NT)
for ( i = 0; i<SIZE; i+=1) { for ( i = 0; i<SIZE; i+=1) {
A[i]=A[i]*B[i]*S; A[i]=A[i]*B[i]*S;
} }

◼ If none of previous clauses is present, the number of chunks and the number of iterations per chunk
is implementation defined
◼ Additional considerations:
→ The order of the creation of the loop tasks is unspecified

→ Taskloop creates an implicit taskgroup region; nogroup → no implicit taskgroup region is created

Advanced OpenMP
6
Collapsing iteration spaces with taskloop
◼ The collapse clause in the taskloop construct #pragma omp taskloop collapse(2)
for ( i = 0; i<SX; i+=1) {
#pragma omp taskloop collapse(n) for ( j= 0; i<SY; j+=1) {
{structured-for-loops} for ( k = 0; i<SZ; k+=1) {
A[f(i,j,k)]=<expression>;
}
→ Number of loops associated with the taskloop construct (n) }
}
→ Loops are collapsed into one larger iteration space

→ Then divided according to the grainsize and num_tasks


◼ Intervening code between any two associated loops
#pragma omp taskloop
→ at least once per iteration of the enclosing loop for ( ij = 0; i<SX*SY; ij+=1) {
for ( k = 0; i<SZ; k+=1) {
→ at most once per iteration of the innermost loop i = index_for_i(ij);
j = index_for_j(ij);
A[f(i,j,k)]=<expression>;
}
}

Advanced OpenMP
7
Task reductions (using taskloop)
◼ Clause: reduction(r-id: list) double dotprod(int n, double *x, double *y) {
double r = 0.0;
→ It defines the scope of a new reduction #pragma omp taskloop reduction(+: r)
for (i = 0; i < n; i++)
→ All created tasks participate in the reduction r += x[i] * y[i];
→ It cannot be used with the nogroup clause return r;
}

◼ Clause: in_reduction(r-id: list) double dotprod(int n, double *x, double *y) {


double r = 0.0;
→ Reuse an already defined reduction scope
#pragma omp taskgroup task_reduction(+: r)
{
→ All created tasks participate in the reduction
#pragma omp taskloop in_reduction(+: r)*
→ It can be used with the nogroup* clause, but it for (i = 0; i < n; i++)
r += x[i] * y[i];
is user responsibility to guarantee result }
return r;
}

Advanced OpenMP
8
Composite construct: taskloop simd
◼ Task generating construct: decompose a loop into chunks, create a task for each loop chunk
◼ Each generated task will apply (internally) SIMD to each loop chunk
→ C/C++ syntax:
#pragma omp taskloop simd [clause[[,] clause]…]
{structured-for-loops}

→ Fortran syntax:
!$omp taskloop simd [clause[[,] clause]…]
…structured-do-loops…
!$omp end taskloop

◼ Where clause is any of the clauses accepted by taskloop or simd directives

Advanced OpenMP
9
Improving Tasking Performance:
Task dependences

Advanced OpenMP
1
Motivation
◼ Task dependences as a way to define task-execution constraints
int x = 0; OpenMP 3.1 int x = 0; OpenMP 4.0
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;

#pragma omp taskwait

#pragma omp task #pragma omp task depend(inout: x)


x++; x++;
} }

t1
OpenMP 3.1
t2
Task’s creation time

Task’s execution time

OpenMP 4.0 t1
t2
Advanced OpenMP
2
Motivation
◼ Task dependences as a way to define task-execution constraints
int x = 0; OpenMP 3.1 int x = 0; OpenMP 4.0
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;
Task dependences can help us to remove
#pragma omp taskwait “strong” synchronizations, increasing the look
#pragma omp task ahead and, frequently, the#pragma
parallelism!!!!
omp task depend(inout: x)
x++; x++;
} }

t1
OpenMP 3.1
t2
Task’s creation time

Task’s execution time

OpenMP 4.0 t1
t2
Advanced OpenMP
3
Motivation: Cholesky factorization
void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
potrf(a[k][k], ts, ts); ts #pragma omp task depend(inout: a[k][k])
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
OpenMP 3.1 } OpenMP 4.0

Advanced OpenMP
4
Motivation: Cholesky factorization
void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
potrf(a[k][k], ts, ts); ts #pragma omp task depend(inout: a[k][k])
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
OpenMP 3.1 } OpenMP 4.0

Advanced OpenMP
Using 2017 Intel compiler
5
What’s in the spec

Advanced OpenMP
6
What’s in the spec: a bit of history
OpenMP 4.0 OpenMP 4.5

• The depend clause was added • The depend clause was added to the
to the task construct target constructs
• Support to doacross loops

OpenMP 5.0

• lvalue expressions in the depend clause


• New dependency type: mutexinoutset
• Iterators were added to the depend clause
• The depend clause was added to the taskwait construct
• Dependable objects

Advanced OpenMP
7
What’s in the spec: syntax depend clause

depend([depend-modifier,] dependency-type: list-items)

where:
→ depend-modifier is used to define iterators

→ dependency-type may be: in, out, inout, mutexinoutset and depobj

→ A list-item may be:


• C/C++: A lvalue expr or an array section depend(in: x, v[i], *p, w[10:10])

• Fortran: A variable or an array section depend(in: x, v(i), w(10:20))

Advanced OpenMP
8
What’s in the spec: sema depend clause (1)
◼ A task cannot be executed until all its predecessor tasks are completed

◼ If a task defines an in dependence over a list-item


→ the task will depend on all previously generated sibling tasks that reference that list-item in an out or
inout dependence

◼ If a task defines an out/inout dependence over list-item


→ the task will depend on all previously generated sibling tasks that reference that list-item in an in, out or
inout dependence

Advanced OpenMP
9
What’s in the spec: depend clause (1)
◼ A task cannot be executed until all its predecessor tasks are completed

int x = 0;
◼ If a task defines an in
#pragma dependence over a variable
omp parallel
#pragma omp single
→ the task will depend
{ on all previously generated sibling tasks that reference at least one of the list items in
#pragma omp task depend(inout: x) //T1 T1
an out or inout dependence
{ ... }

#pragma omp task depend(in: x) //T2 T2 T3


{ ... }
◼ If a task defines an out/inout dependence over a variable
#pragma omp task depend(in: x) //T3
→ the task will depend
{ ...on} all previously generated sibling tasks that reference
T4 at least one of the list items in
an in, out or inout dependence
#pragma omp task depend(inout: x) //T4
{ ... }
}

Advanced OpenMP
10
What’s in the spec: depend clause (2)
◼ New dependency type: mutexinoutset
T1 T0 T2
int x = 0, y = 0, res = 0;
#pragma omp parallel
#pragma omp single
{ T3
#pragma omp task depend(out: res) //T0 T4
res = 0;
T3
#pragma omp task depend(out: x) //T1 T4
long_computation(x);

#pragma omp task depend(out: y) //T2


short_computation(y);
T5

#pragma omp task depend(in: x) depend(inout: res) //T3


depend(mutexinoutset: res) //T3
res += x; 1. inoutset property: tasks with a mutexinoutset
dependence create a cloud of tasks (an inout set) that
#pragma omp task depend(in: y) depend(mutexinoutset:
depend(inout: res) //T4
res) //T4 synchronizes with previous & posterior tasks that
res += y; dependent on the same list item

#pragma omp task depend(in: res) //T5 2. mutex property: Tasks inside the inout set can be
std::cout << res << std::endl; executed in any order but with mutual exclusion
}
Advanced OpenMP
11
What’s in the spec: depend clause (4)
◼ Task dependences are ◼ List items used in the depend
defined among sibling tasks clauses […] must indicate identical
or disjoint storage
//test1.cc //test2.cc
int x = 0; int a[100] = {0};
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task depend(inout: x) //T1 #pragma omp task depend(inout: a[50:99]) //T1
{ compute(/* from */ &a[50], /*elems*/ 50);
#pragma omp task depend(inout: x) //T1.1 T1
x++; #pragma omp task depend(in: a) //T2
print(/* from */ a, /* elem */ 100);
#pragma omp taskwait } ???
}
#pragma omp task depend(in: x) //T2
std::cout << x << std::endl; T2
}

Advanced OpenMP
12
What’s in the spec: depend clause (5)
◼ Iterators + deps: a way to define a dynamic number of dependences

std::list<int> list = ...; It seems innocent but it’s not:


depend(out: list.operator[](i))
int n = list.size();

#pragma omp parallel P1 P2 ... Pn

#pragma omp single


{
for (int i = 0; i < n; ++i)
#pragma omp task depend(out: list[i]) //Px
compute_elem(list[i]); ???
C
#pragma omp task depend(iterator(j=0:n),
depend(in: ???) in : list[j]) //C
print_elems(list);
}
Equivalent to:
depend(in: list[0], list[1], …, list[n-1])

Advanced OpenMP
13
Philosophy

Advanced OpenMP
14
Philosophy: data-flow model
◼ Task dependences are orthogonal to data-sharings
→ Dependences as a way to define a task-execution constraints

→ Data-sharings as how the data is captured to be used inside the task

// test1.cc // test2.cc
int x = 0; int x = 0;
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task depend(inout: x) \ #pragma omp task depend(inout: x) //T1
firstprivate(x) //T1 x++;
x++;
#pragma omp task depend(in: x) \
#pragma omp task depend(in: x) //T2 firstprivate(x) //T2
std::cout << x << std::endl; std::cout << x << std::endl;
} }

OK, but it always prints ‘0’ :( We have a data-race!!


Advanced OpenMP
15
Philosophy: data-flow model (2)
◼ Properly combining dependences and data-sharings allow us to define
a task data-flow model
→Data that is read in the task → input dependence

→Data that is written in the task → output dependence

◼ A task data-flow model


→Enhances the composability

→Eases the parallelization of new regions of your code

Advanced OpenMP
16
Philosophy: data-flow model (3)
//test1_v1.cc //test1_v2.cc
int x = 0, y = 0; int//test1_v3.cc
x = 0, y = 0;
#pragma omp parallel #pragma
int x omp
= 0,parallel
y = 0;
#pragma omp single //test1_v4.cc
#pragma omp single
#pragma
int x omp
= 0,parallel
y = 0;
{ { #pragma omp single
#pragma omp task depend(inout: x) //T1 #pragma
#pragma ompomp parallel
task depend(inout: x) //T1
{ #pragma omp single
{ { #pragma omp task depend(inout: x) //T1
x++; {
x++;
{ #pragma omp task depend(inout: x, y) //T1
y++; // !!! y++;
x++;
} } {
y++;
x++;
#pragma omp task depend(in: x) //T2 #pragma
} omp task depend(in: x) //T2
std::cout << x << std::endl; y++;
std::cout <<
#pragma
} ompx task
<< std::endl;
depend(in: x) //T2
std::cout
#pragma <<
ompxtask
<< std::endl;
depend(in:
#pragma omp taskwait #pragma omp task depend(in: y) x)//T3 //T2
std::cout << y << std::endl; std::couty <<
std::cout << xstd::endl;
<< std::endl;
#pragma <<omp task depend(in: x) //T3
} } std::cout
#pragma <<ompytask
<< std::endl;
depend(in: y) //T3
} std::cout << y << std::endl;
}

If all tasks are properly annotated,


we only have to worry about the
dependendences & data-sharings of the new task!!!
Advanced OpenMP
17
Use case

Advanced OpenMP
18
Use case: intro to Gauss-seidel

void serial_gauss_seidel(int tsteps, int size, int (*p)[size]) { Access pattern analysis
for (int t = 0; t < tsteps; ++t) { For a specific t, i and j
for (int i = 1; i < size-1; ++i) {
for (int j = 1; j < size-1; ++j) {
p[i][j] = 0.25 * (p[i][j-1] * // left
p[i][j+1] * // right tn
p[i-1][j] * // top
p[i+1][j]); // bottom
}
}
}
} Each cell depends on:
- two cells (north & west) that are
computed in the current time step, and
- two cells (south & east) that were
computed in the previous time step

Advanced OpenMP
19
Use case: Gauss-seidel (2)

void serial_gauss_seidel(int tsteps, int size, int (*p)[size]) { 1st parallelization strategy
for (int t = 0; t < tsteps; ++t) { For an specific t
for (int i = 1; i < size-1; ++i) {
for (int j = 1; j < size-1; ++j) {
p[i][j] = 0.25 * (p[i][j-1] * // left
p[i][j+1] * // right tn
p[i-1][j] * // top
p[i+1][j]); // bottom
}
}
}
}

We can exploit the wavefront to


obtain parallelism!!

Advanced OpenMP
20
Use case : Gauss-seidel (3)
void gauss_seidel(int tsteps, int size, int TS, int (*p)[size]) {
int NB = size / TS;
#pragma omp parallel
for (int t = 0; t < tsteps; ++t) {
// First NB diagonals
for (int diag = 0; diag < NB; ++diag) {
#pragma omp for
for (int d = 0; d <= diag; ++d) {
int ii = d;
int jj = diag – d;
for (int i = 1+ii*TS; i < ((ii+1)*TS); ++i)
for (int j = 1+jj*TS; i < ((jj+1)*TS); ++j)
p[i][j] = 0.25 * (p[i][j-1] * p[i][j+1] *
p[i-1][j] * p[i+1][j]);
}
}
// Lasts NB diagonals
for (int diag = NB-1; diag >= 0; --diag) {
// Similar code to the previous loop
}
}
}

Advanced OpenMP
21
Use case : Gauss-seidel (4)

void serial_gauss_seidel(int tsteps, int size, int (*p)[size]) { 2nd parallelization strategy
for (int t = 0; t < tsteps; ++t) { multiple time iterations
for (int i = 1; i < size-1; ++i) {
for (int j = 1; j < size-1; ++j) {
p[i][j] = 0.25 * (p[i][j-1] * // left
p[i][j+1] * // right tn
p[i-1][j] * // top tn+1
p[i+1][j]); // bottom tn+2
} tn+3
}
}
}

We can exploit the wavefront


of multiple time steps to obtain MORE
parallelism!!
Advanced OpenMP
22
Use case : Gauss-seidel (5)
void gauss_seidel(int tsteps, int size, int TS, int (*p)[size]) { inner matrix region
int NB = size / TS;

#pragma omp parallel


#pragma omp single
for (int t = 0; t < tsteps; ++t)
for (int ii=1; ii < size-1; ii+=TS)
for (int jj=1; jj < size-1; jj+=TS) {
#pragma omp task depend(inout: p[ii:TS][jj:TS])
depend(in: p[ii-TS:TS][jj:TS], p[ii+TS:TS][jj:TS],
p[ii:TS][jj-TS:TS], p[ii:TS][jj:TS])
{ Q: Why do the input dependences
for (int i=ii; i<(1+ii)*TS; ++i) depend on the whole block rather
for (int j=jj; j<(1+jj)*TS; ++j)
p[i][j] = 0.25 * (p[i][j-1] * p[i][j+1] * than just a column/row?
p[i-1][j] * p[i+1][j]);
}
}
}
vs

Advanced OpenMP
23
Use case : Gauss-seidel (5)
void gauss_seidel(int tsteps, int size, int TS, int (*p)[size]) { inner matrix region
int NB = size / TS;

#pragma omp parallel


#pragma omp single
for (int t = 0; t < tsteps; ++t)
for (int ii=1; ii < size-1; ii+=TS)
for (int jj=1; jj < size-1; jj+=TS) {
#pragma omp task depend(inout: p[ii:TS][jj:TS])
depend(in: p[ii-TS:TS][jj:TS], p[ii+TS:TS][jj:TS],
p[ii:TS][jj-TS:TS], p[ii:TS][jj:TS])
{ Q: Why do the input dependences
for (int i=ii; i<(1+ii)*TS; ++i) depend on the whole block rather
for (int j=jj; j<(1+jj)*TS; ++j)
p[i][j] = 0.25 * (p[i][j-1] * p[i][j+1] * than just a column/row?
p[i-1][j] * p[i+1][j]);
}
}
}
vs

Advanced OpenMP
24
OpenMP 5.0: (even) more advanced features

Advanced OpenMP
25
Advanced features: deps on taskwait
◼ Adding dependences to the taskwait construct
→Using a taskwait construct to explicitly wait for some predecessor tasks
→Syntactic sugar!
int x = 0, y = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp task depend(inout: x) //T1
x++;
#pragma omp task depend(in: y) //T2
std::cout << y << std::endl;
#pragma omp taskwait depend(in: x)
std::cout << x << std::endl;
}

Advanced OpenMP
26
Advanced features: dependable objects (1)
◼ Offer a way to manually handle dependences
→Useful for complex task dependences

→It allows a more efficient allocation of task dependences

→New omp_depend_t opaque type

→3 new constructs to manage dependable objects


→#pragma omp depobj(obj) depend(dep-type: list)

→#pragma omp depobj(obj) update(dep-type)

→#pragma omp depobj(obj) destroy

Advanced OpenMP
27
Advanced features: dependable objects (2)
◼ Offer a way to manually handle dependences

int x = 0;
#pragma omp parallel
#pragma omp single
{
int x = 0;
omp_depend_t obj;
#pragma omp parallel
T1 #pragma omp depobj(obj) depend(inout: x)
#pragma omp single
{
#pragma omp task depend(depobj: obj) //T1
#pragma omp task depend(inout: x) //T1
x++;
x++;
#pragma omp depobj(obj) update(in)
#pragma omp task depend(in: x) //T2
std::cout << x << std::endl; T2
#pragma omp task depend(depobj: obj) //T2
}
std::cout << x << std::endl;

#pragma omp depobj(obj) destroy


}

Advanced OpenMP
28
Cancellation

Advanced OpenMP
1
OpenMP 3.1 Parallel Abort
◼ Once started, parallel execution cannot be aborted in OpenMP 3.1
→ Code regions must always run to completion
→ (or not start at all)

◼ Cancellation in OpenMP 4.0 provides a best-effort approach to


terminate OpenMP regions
→ Best-effort: not guaranteed to trigger termination immediately
→ Triggered “as soon as” possible

Advanced OpenMP
2
Cancellation Constructs
◼ Two constructs:
→ Activate cancellation:
C/C++: #pragma omp cancel
Fortran: !$omp cancel
→ Check for cancellation:
C/C++: #pragma omp cancellation point
Fortran: !$omp cancellation point
◼ Check for cancellation only a certain points
→ Avoid unnecessary overheads
→ Programmers need to reason about cancellation
→ Cleanup code needs to be added manually

Advanced OpenMP
3
Cancellation Semantics
Thread A Thread B Thread C

parallel region
Advanced OpenMP
4
Cancellation Semantics
Thread A Thread B Thread C

parallel region
Advanced OpenMP
5
Cancellation Semantics
Thread A Thread B Thread C

parallel region
Advanced OpenMP
6
Cancellation Semantics
Thread A Thread B Thread C

parallel region
Advanced OpenMP
7
cancel Construct
◼ Syntax:
#pragma omp cancel construct-type-clause [ [, ]if-clause]
!$omp cancel construct-type-clause [ [, ]if-clause]

◼ Clauses:
parallel
sections
for (C/C++)
do (Fortran)
taskgroup
if (scalar-expression)

◼ Semantics
→ Requests cancellation of the inner-most OpenMP region of the type specified in
construct-type-clause
→ Lets the encountering thread/task proceed to the end of the canceled region

Advanced OpenMP
8
cancellation point Construct
◼ Syntax:
#pragma omp cancellation point construct-type-clause
!$omp cancellation point construct-type-clause

◼ Clauses:
parallel
sections
for (C/C++)
do (Fortran)
taskgroup

◼ Semantics
→ Introduces a user-defined cancellation point
→ Pre-defined cancellation points:
→ implicit/explicit barriers regions
→ cancel regions

Advanced OpenMP
9
Cancellation of OpenMP Tasks
◼ Cancellation only acts on tasks grouped by the taskgroup construct
→ The encountering tasks jumps to the end of its task region
→ Any executing task will run to completion
(or until they reach a cancellation point region)
→ Any task that has not yet begun execution may be discarded
(and is considered completed)

◼ Tasks cancellation also occurs, if a parallel region is canceled.


→ But not if cancellation affects a worksharing construct.

Advanced OpenMP
10
Task Cancellation Example
binary_tree_t* search_tree_parallel(binary_tree_t* tree, int value) {
binary_tree_t* found = NULL;
#pragma omp parallel shared(found,tree,value)
{
#pragma omp master
{
#pragma omp taskgroup
{
found = search_tree(tree, value);
}
}
}
return found;
}

Advanced OpenMP
11
Task Cancellation Example
binary_tree_t* search_tree(
binary_tree_t* tree, int value,
int level) { #pragma omp task shared(found)
binary_tree_t* found = NULL; {
if (tree) { binary_tree_t* found_right;
if (tree->value == value) { found_right =
found = tree; search_tree(tree->right, value);
} if (found_right) {
else { #pragma omp atomic write
#pragma omp task shared(found) found = found_right;
{ #pragma omp cancel taskgroup
binary_tree_t* found_left; }
found_left = }
search_tree(tree->left, value); #pragma omp taskwait
if (found_left) { }
#pragma omp atomic write }
found = found_left; return found;
#pragma omp cancel taskgroup }
}
}

Advanced OpenMP
12
Improving Tasking Performance:
Cutoff clauses and strategies

Advanced OpenMP
1
Example: Sudoku revisited

Advanced OpenMP
2
Parallel Brute-force Sudoku
◼ This parallel algorithm finds all valid solutions
◼ (1) Search an empty fieldfirst call contained in a
#pragma omp parallel
#pragma omp single
such that one tasks starts the
◼ (2) Try all numbers: execution of the algorithm
◼ (2 a) Check Sudoku
◼ If invalid: skip
◼ If valid: Go to next#pragma omp task
needs to work on a new copy
field of the Sudoku board

#pragma omp taskwait


◼ Wait for completion wait for all child tasks

Advanced OpenMP
3
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding

8 4.0

7 3.5
Runtime [sec] for 16x16

6 3.0

5 2.5

Speedup
4 2.0

3 1.5

2 1.0

1 0.5

0 0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#threads

Advanced OpenMP
4
Performance Analysis
Event-based profiling provides a
Tracing provides more details:
good overview :

lvl 6

Duration: 0.16 sec

lvl 12
Every thread is executing ~1.3m tasks… Duration: 0.047 sec

lvl 48
Duration: 0.001 sec

lvl 82
Duration: 2.2 μs
… in ~5.7 seconds. Tasks get much smaller
=> average duration of a task is ~4.4 μs down the call-stack.
Advanced OpenMP
5
Performance Analysis
Event-based profiling provides a
Tracing provides more details:
good overview :

lvl 6

Duration: 0.16 sec

If you have enough parallelism, stop creating more tasks!!


lvl 12
• if-clause, final-clause,
Every thread is executing ~1.3m tasks… mergeable-clause
• natively in your program code Duration: 0.047 sec

lvl 48
Duration: 0.001 sec

lvl 82
Duration: 2.2 μs
… in ~5.7 seconds. Tasks get much smaller
=> average duration of a task is ~4.4 μs down the call-stack.
Advanced OpenMP
6
Performance Evaluation (with cutoff)
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding Intel C++ 13.1, scatter binding, cutoff
speedup: Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding, cutoff
8 18

7 16
Runtime [sec] for 16x16

14
6
12
5

Speedup
10
4
8
3
6

2
4

1 2

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#threads

Advanced OpenMP
7
The if clause
◼ Rule of thumb: the if(expression)clause as a “switch off” mechanism
→ Allows lightweight implementations of task creation and execution but it reduces the parallelism

◼ If the expression of the if clause evaluates to false


→ the encountering task is suspended int foo(int x) {
printf(“entering foo function\n”);
→ the new task is executed immediately (task int res = 0;
#pragma omp task shared(res) if(false)
dependences are respected!!) {
res += x;
→ the encountering task resumes its execution }
printf(“leaving foo function\n”);
once the new task is completed }

→ This is known as undeferred task


Really useful to debug tasking applications!

◼ Even if the expression is false, data-sharing clauses are honored

Advanced OpenMP
8
The final clause
◼ The final(expression) clause
→ Nested tasks / recursive applications

→ allows to avoid future task creation → reduces overhead but also reduces parallelism

◼ If the expression of the final clause evaluates to true


→ The new task is created and executed normally but in its context all tasks will be executed immediately
by the same thread (included tasks)
e == false A e == true A
#pragma omp task final(e)
{
#pragma omp task
{ … } …
#pragma omp task Code_B;
{ … #C.1; #C.2 … } B C Code_C;
#pragma omp taskwait code_c1;
} code_c2;
...

◼ Data-sharing clauses are honored too! C.1 C.2

Advanced OpenMP
9
The mergeable clause
◼ The mergeable clause
→ Optimization: get rid of “data-sharing clauses are honored”

→ This optimization can only be applied in undeferred or included tasks

◼ A Task that is annotated with the mergeable clause is called a mergeable task
→ A task that may be a merged task if it is an undeferred task or an included task

◼ A merged task is:


→ A task for which the data environment (inclusive of ICVs) may be the same as that of
its generating task region

◼ A good implementation could execute a merged task without adding any OpenMP-
related overhead Unfortunately, there are no OpenMP
commercial implementations taking
advantage of final neither mergeable =(
Advanced OpenMP
10
Vectorization w/ OpenMP SIMD

1 Advanced OpenMP
Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY
THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY
EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY
OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY
RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and
reflect the approximate performance of Intel products as measured by those tests. Any difference in
system hardware or software design or configuration may affect actual performance. Buyers should
consult other sources of information to evaluate the performance of systems or components they are
considering purchasing. For more information on performance tests and on the performance of Intel
products, reference www.intel.com/software/products.
All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, VTune, and Cilk are trademarks of Intel
Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.

Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations
that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and
Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

2 Advanced OpenMP
Evolution of Intel Hardware

Images not intended to reflect actual die sizes

64-bit Intel® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon®
Xeon® processor processor processor processor E5- Scalable
processor 5100 series 5500 series 5600 series 2600v3 series Processor

Frequency 3.6 GHz 3.0 GHz 3.2 GHz 3.3 GHz 2.3 GHz 2.5 GHz

Core(s) 1 2 4 6 18 28

Thread(s) 2 2 8 12 36 56
128 128 128 128 256 512
SIMD width
(2 clock) (1 clock) (1 clock) (1 clock) (1 clock) (1 clock)

3 Advanced OpenMP
Levels of Parallelism
◼ OpenMP already supports several levels of parallelism in today’s hardware

Cluster Group of computers


communicating through fast interconnect

Coprocessors/Accelerators Special compute devices


attached to the local node through special interconnect

Node Group of processors


communicating through shared memory

Socket Group of cores


communicating through shared cache

Core Group of functional units


communicating through registers

Hyper-Threads Group of thread contexts sharing functional units

Superscalar Group of instructions sharing functional units

Pipeline Sequence of instructions sharing functional units

Vector Single instruction using multiple functional units

4 Advanced OpenMP
SIMD on Intel® Architecture
◼ Width of SIMD registers has been growing in the past:
128 bit

2 x DP
SSE
4 x SP

256 bit

4 x DP
AVX
8 x SP

512 bit

8 x DP
AVX-512
16 x SP

5 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vaddpd dest, source1, source2
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2
=
a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0
dest

6 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vfmadd213pd source1, source2, source3
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
*
b7 b6 b5 b4 b3 b2 b1 b0 source2
+
c7 c6 c5 c4 c3 c2 c1 c0 source3

=
a7*b7 a6*b6 a5*b5 a4 *b4 a3*b3 a2*b2 a1*b1 a0*b0 dest
+c7 +c6 +c5 +c4 +c3 +c2 +c1 +c0

7 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vaddpd dest{k1}, source2, source3
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2

1 0 1 0 0 1 0 1 mask
=
a7+b7 d6 a5+b5 d4 d3 a2+b2 d1 a0+b0 dest

8 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vmovapd dest, source{dacb}
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source
swizzle

a7 a4 a6 a5 a3 a0 a2 a1 “tmp”
move

a7 a4 a6 a5 a3 a0 a2 a1 dest

9 Advanced OpenMP
Auto-vectorization
◼ Compilers offer auto-vectorization as an optimization pass
→Usually part of the general loop optimization passes
→Code analysis detects code properties that inhibit SIMD vectorization
→Heuristics determine if SIMD execution might be beneficial
?
→If all goes well, the compiler will generate SIMD instructions

◼ Example: Intel® Composer XE


→-vec (automatically enabled with –O2)
→-qopt-report

10 Advanced OpenMP
Why Auto-vectorizers Fail
◼ Data dependencies
◼ Other potential reasons
→Alignment
→Function calls in loop block
→Complex control flow / conditional branches
→Loop not “countable”
→e.g., upper bound not a runtime constant
→Mixed data types
→Non-unit stride between elements
→Loop body too complex (register pressure)
→Vectorization seems inefficient
◼ Many more … but less likely to occur

11 Advanced OpenMP
Data Dependencies
◼ Suppose two statements S1 and S2
◼ S2 depends on S1, iff S1 must execute before S2
→Control-flow dependence
→Data dependence
→Dependencies can be carried over between loop iterations
◼ Important flavors of data dependencies
FLOW ANTI
s1: a = 40 b = 40

b = 21 s1: a = b + 1
s2: c = a + 2 s2: b = 21

12 Advanced OpenMP
Loop-Carried Dependencies
◼ Dependencies may occur across loop iterations
→Loop-carried dependency
◼ The following code contains such a dependency:
void lcd_ex(float* a, float* b, size_t n, float c1, float c2)
{
size_t i;
for (i = 0; i < n; i++) {
a[i] = c1 * a[i + 17] + c2 * b[i];
}
}
Loop-carried dependency for a[i] and
◼ Some iterations of the loop have to a[i+17]; distance is 17.
complete before the next iteration can run
→Simple trick: Can you reverse the loop w/o getting wrong results?

13 Advanced OpenMP
Loop-carried Dependencies
◼ Can we parallelize or vectorize the loop?
void lcd_ex(float* a, float* b, size_t n, float c1, float c2) {
for (int i = 0; i < n; i++) {
a[i] = c1 * a[i + 17] + c2 * b[i];
} }

Thread 1 Thread 2

0 1 2 3 17 18 19 20

→ Parallelization: no
(except for very specific loop schedules)
→ Vectorization: yes
(iff vector length is shorter than any distance of any dependency)

14 Advanced OpenMP
Example: Loop not Countable
◼ “Loop not Countable” plus “Assumed Dependencies”

typedef struct {
float* data;
size_t size;
} vec_t;

void vec_eltwise_product(vec_t* a, vec_t* b, vec_t* c) {


size_t i;
for (i = 0; i < a->size; i++) {
c->data[i] = a->data[i] * b->data[i];
}
}

15 Advanced OpenMP
In a Time Before OpenMP 4.0
◼ Support required vendor-specific extensions
→Programming models (e.g., Intel® Cilk Plus)
→Compiler pragmas (e.g., #pragma vector)
→Low-level constructs (e.g., _mm_add_pd())

#pragma omp parallel for You need to trust


#pragma vector always your compiler to do
#pragma ivdep the “right” thing.
for (int i = 0; i < N; i++) {
a[i] = b[i] + ...;
}

16 Advanced OpenMP
SIMD Loop Construct
◼ Vectorize a loop nest
→Cut loop into chunks that fit a SIMD vector register
→No parallelization of the loop body

◼ Syntax (C/C++)
#pragma omp simd [clause[[,] clause],…]
for-loops

◼ Syntax (Fortran)
!$omp simd [clause[[,] clause],…]
do-loops
[!$omp end simd]
17 Advanced OpenMP
Example
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}

vectorize

18 Advanced OpenMP
Data Sharing Clauses
◼ private(var-list):
Uninitialized vectors for variables in var-list
x: 42 ? ? ? ?

◼ firstprivate(var-list):
Initialized vectors for variables in var-list
x: 42 42 42 42 42

◼ reduction(op:var-list):
Create private variables for var-list and apply reduction operator op at the end of the construct

12 5 8 17 x: 42

19 Advanced OpenMP
SIMD Loop Clauses
◼ safelen (length)
→Maximum number of iterations that can run concurrently without breaking a
dependence
→In practice, maximum vector length
◼ linear (list[:linear-step])
→The variable’s value is in relationship with the iteration number
→xi = xorig + i * linear-step
◼ aligned (list[:alignment])
→Specifies that the list items have a given alignment
→Default is alignment for the architecture
◼ collapse (n)

20 Advanced OpenMP
SIMD Worksharing Construct
◼ Parallelize and vectorize a loop nest
→Distribute a loop’s iteration space across a thread team
→Subdivide loop chunks to fit a SIMD vector register

◼ Syntax (C/C++)
#pragma omp for simd [clause[[,] clause],…]
for-loops

◼ Syntax (Fortran)
!$omp do simd [clause[[,] clause],…]
do-loops
[!$omp end do simd [nowait]]
21 Advanced OpenMP
Example
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}

parallelize
Thread 0 Thread 1 Thread 2

vectorize Remainder Loop Peel Loop

22 Advanced OpenMP
Be Careful What You Wish For…
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}

◼ You should choose chunk sizes that are multiples of the SIMD length
→ Remainder loops are not triggered
→ Likely better performance
◼ In the above example …
→ and AVX2, the code will only execute the remainder loop!
→ and SSE, the code will have one iteration in the SIMD loop plus one in the remainder loop!

23 Advanced OpenMP
OpenMP 4.5 Simplifies SIMD Chunks
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(simd: static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}

◼ Chooses chunk sizes that are multiples of the SIMD length


→First and last chunk may be slightly different to fix alignment and to handle
loops that are not exact multiples of SIMD width
→Remainder loops are not triggered
→Likely better performance
24 Advanced OpenMP
SIMD Function Vectorization

float min(float a, float b) {


return a < b ? a : b;
}

float distsq(float x, float y) {


return (x - y) * (x - y);
}

void example() {
#pragma omp parallel for simd
for (i=0; i<N; i++) {
d[i] = min(distsq(a[i], b[i]), c[i]);
} }

25 Advanced OpenMP
SIMD Function Vectorization
◼ Declare one or more functions to be compiled for calls from a SIMD-
parallel loop

◼ Syntax (C/C++):
#pragma omp declare simd [clause[[,] clause],…]
[#pragma omp declare simd [clause[[,] clause],…]]
[…]
function-definition-or-declaration

◼ Syntax (Fortran):
!$omp declare simd (proc-name-list)

26 Advanced OpenMP
SIMD Function Vectorization
#pragma omp declare simd
float min(float a, float b) { _ZGVZN16vv_min(%zmm0, %zmm1):
return a < b ? a : b; vminps %zmm1, %zmm0, %zmm0
} ret

#pragma omp declare simd


_ZGVZN16vv_distsq(%zmm0, %zmm1):
float distsq(float x, float y) {
return (x - y) * (x - y); vsubps %zmm0, %zmm1, %zmm2
} vmulps %zmm2, %zmm2, %zmm0
ret
void example() {
#pragma omp parallel for simd
for (i=0; i<N; i++) {
d[i] = min(distsq(a[i], b[i]), c[i]);
} }
vmovups (%r14,%r12,4), %zmm0
vmovups (%r13,%r12,4), %zmm1
call _ZGVZN16vv_distsq
vmovups (%rbx,%r12,4), %zmm1
27 Advanced OpenMP
call _ZGVZN16vv_min
SIMD Function Vectorization
◼ simdlen (length)
→ generate function to support a given vector length
◼ uniform (argument-list)
→ argument has a constant value between the iterations of a given loop
◼ inbranch
→ function always called from inside an if statement
◼ notinbranch
→ function never called from inside an if statement
◼ linear (argument-list[:linear-step])
◼ aligned (argument-list[:alignment])

28 Advanced OpenMP
inbranch & notinbranch
#pragma omp declare simd inbranch
float do_stuff(float x) {
/* do something */ vec8 do_stuff_v(vec8 x, mask m) {
/* do something */
return x * 2.0;
vmulpd x{m}, 2.0, tmp
} return tmp;
}
void example() {
#pragma omp simd
for (int i = 0; i < N; i++)
if (a[i] < 0.0)
b[i] = do_stuff(a[i]);
}
for (int i = 0; i < N; i+=8) {
vcmp_lt &a[i], 0.0, mask
b[i] = do_stuff_v(&a[i], mask);
}

29 Advanced OpenMP
SIMD Constructs & Performance
5.00x
ICC auto-vec
4.50x 4.34x
ICC SIMD directive
4.00x
3.66x
3.50x
relative speed-up
(higher is better) 3.00x

2.50x 2.40x
2.13x
2.04x
2.00x
1.47x
1.50x

1.00x

0.50x

0.00x
Mandelbrot Volume Rendering BlackScholes Fast Walsh Perlin Noise SGpp

M.Klemm, A.Duran, X.Tian, H.Saito, D.Caballero, and X.Martorell. Extending OpenMP with Vector Constructs for Modern
Multicore SIMD Architectures. In Proc. of the Intl. Workshop on OpenMP, pages 59-72, Rome, Italy, June 2012. LNCS 7312.

30 Advanced OpenMP
OpenMP: Memory Access

1 Advanced OpenMP
Example: Loop Parallelization
◼ Assume the following: you have learned that load imbalances can
severely impact performance and a dynamic loop schedule may
prevent this:
→What is the issue with the following code:
double* A;
A = (double*) malloc(N * sizeof(double));
/* assume some initialization of A */

#pragma omp parallel for schedule(dynamic, 1)


for (int i = 0; i < N; i++) {
A[i] += 1.0;
}
→How is A accessed? Does that affect performance?

2 Advanced OpenMP
False Sharing

◼ False Sharing: Parallel accesses to the same cache line may have a significant performance
impact!
Caches are organized in lines of typically
2: A[1]+=1; 1: A[0]+=1;
4: A[3]+=1; 3: A[2]+=1; 64 bytes: integer array a[0-4] fits into
Core Core Core Core one cache line.
on-chip cache on-chip cache
Whenever one element of a cache line
is updated, the whole cache line is
Invalidated.
bus
Local copies of a cache line have to be
re-loaded from the main memory and
memory the computation may have to be
A[0-7] repeated.

3 Advanced OpenMP
Non-uniform Memory
How To Distribute The Data ?

double* A;
A = (double*) Core Core Core Core
malloc(N * sizeof(double));
on-chip on-chip on-chip on-chip
cache cache cache cache

interconnect
for (int i = 0; i < N; i++) {
A[i] = 0.0;
}

memory memory

4 Advanced OpenMP
Non-uniform Memory
◼ Serial code: all array elements are allocated in the memory of the NUMA node closest to the
core executing the initializer thread (first touch)

double* A;
A = (double*)
malloc(N * sizeof(double)); Core Core Core Core

on-chip on-chip on-chip on-chip


cache cache cache cache

for (int i = 0; i < N; i++) { interconnect


A[i] = 0.0;
}

memory memory

A[0] … A[N]
5 Advanced OpenMP
About Data Distribution
◼ Important aspect on cc-NUMA systems
→If not optimal, longer memory access times and hotspots

◼ Placement comes from the Operating System


→This is therefore Operating System dependent

◼ Windows, Linux and Solaris all use the “First Touch” placement policy
by default
→May be possible to override default (check the docs)

6 Advanced OpenMP
Non-uniform Memory
◼ Serial code: all array elements are allocated in the memory of the NUMA node closest to the
core executing the initializer thread (first touch)

double* A;
A = (double*)
malloc(N * sizeof(double)); Core Core Core Core

on-chip on-chip on-chip on-chip


cache cache cache cache

for (int i = 0; i < N; i++) { interconnect


A[i] = 0.0;
}

memory memory

A[0] … A[N]
7 Advanced OpenMP
First Touch Memory Placement
◼ First Touch w/ parallel code: all array elements are allocated in the memory of the NUMA
node that contains the core that executes the
thread that initializes the partition

double* A;
A = (double*) Core Core Core Core
malloc(N * sizeof(double));
on-chip on-chip on-chip on-chip
cache cache cache cache
omp_set_num_threads(2);

#pragma omp parallel for interconnect


for (int i = 0; i < N; i++) {
A[i] = 0.0;
}

memory memory

A[0] … A[N/2] A[N/2] … A[N]


8 Advanced OpenMP
Serial vs. Parallel Initialization
◼ Stream example on 2 socket sytem with Xeon X5675 processors, 12
OpenMP threads:
copy scale add triad
ser_init 18.8 GB/s 18.5 GB/s 18.1 GB/s 18.2 GB/s
par_init 41.3 GB/s 39.3 GB/s 40.3 GB/s 40.4 GB/s

ser_init: a[0,N-1] T1 T2 T3 T7 T8 T9
b[0,N-1]
MEM CPU 0 CPU 1 MEM
c[0,N-1]
T4 T5 T6 T10 T11 T12

par_init: a[0,(N/2)-1] T1 T2 T3 T7 T8 T9 a[N/2,N-1]


b[0,(N/2)-1]
MEM CPU 0 CPU 1 b[N/2,N-1]
MEM
c[0,(N/2)-1] c[N/2,N-1]
T4 T5 T6 T10 T11 T12

9 Advanced OpenMP
Get Info on the System Topology
◼ Before you design a strategy for thread binding, you should have a basic
understanding of the system topology. Please use one of the following
options on a target machine:
→Intel MPI‘s cpuinfo tool
→ cpuinfo

→Delivers information about the number of sockets (= packages) and the mapping of processor
ids to cpu cores that the OS uses.

→hwlocs‘ hwloc-ls tool


→ hwloc-ls

→Displays a graphical representation of the system topology, separated into NUMA nodes, along
with the mapping of processor ids to cpu cores that the OS uses and additional info on caches.

10 Advanced OpenMP
Decide for Binding Strategy
◼ Selecting the „right“ binding strategy depends not only on the topology,
but also on application characteristics.
→Putting threads far apart, i.e., on different sockets
→May improve aggregated memory bandwidth available to application

→May improve the combined cache size available to your application

→May decrease performance of synchronization constructs

→Putting threads close together, i.e., on two adjacent cores that possibly share
some caches
→May improve performance of synchronization constructs

→May decrease the available memory bandwidth and cache size


11 Advanced OpenMP
Places + Binding Policies (1/2)
◼ Define OpenMP Places
→ set of OpenMP threads running on one or more processors
→ can be defined by the user, i.e. OMP_PLACES=cores

◼ Define a set of OpenMP Thread Affinity Policies


→ SPREAD: spread OpenMP threads evenly among the places,
partition the place list
→ CLOSE: pack OpenMP threads near master thread
→ MASTER: collocate OpenMP thread with master thread

◼ Goals
→ user has a way to specify where to execute OpenMP threads
→ locality between OpenMP threads / less false sharing / memory bandwidth

12 Advanced OpenMP
Places
◼ Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7

→ 2 sockets, 4 cores per socket, 4 hyper-threads per core

◼ Abstract names for OMP_PLACES:


→ threads: Each place corresponds to a single hardware thread on the target machine.
→ cores: Each place corresponds to a single core (having one or more hardware threads) on the
target machine.
→ sockets: Each place corresponds to a single socket (consisting of one or more cores) on the
target machine.

13 Advanced OpenMP
Places + Binding Policies (2/2)
◼ Example‘s Objective:
→separate cores for outer loop and near cores for inner loop
◼ Outer Parallel Region: proc_bind(spread) num_threads(4)
Inner Parallel Region: proc_bind(close) num_threads(4)
→spread creates partition, compact binds threads within respective partition
OMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-3):8:4 = cores
#pragma omp parallel proc_bind(spread) num_threads(4)
#pragma omp parallel proc_bind(close) num_threads(4)
◼ Example
p0 p1 p2 p3 p4 p5 p6 p7
→initial

p0 p1 p2 p3 p4 p5 p6 p7
→spread 4

→close 4 p0 p1 p2 p3 p4 p5 p6 p7

14 Advanced OpenMP
More Examples (1/3)
◼ Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7

→2 sockets, 4 cores per socket, 4 hyper-threads per core

◼ Parallel Region with two threads, one per socket


→OMP_PLACES=sockets

→#pragma omp parallel num_threads(2) proc_bind(spread)

15 Advanced OpenMP
More Examples (2/3)
◼ Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7

◼ Parallel Region with four threads, one per core, but only on the first
socket
→OMP_PLACES=cores

→#pragma omp parallel num_threads(4) proc_bind(close)

16 Advanced OpenMP
More Examples (3/3)
◼ Spread a nested loop first across two sockets, then among the cores
within each socket, only one thread per core
→OMP_PLACES=cores

→#pragma omp parallel num_threads(2) proc_bind(spread)

→#pragma omp parallel num_threads(4) proc_bind(close)

17 Advanced OpenMP
Places API (1/2)
◼ 1: Query information about binding and a single place of
all places with ids 0 … omp_get_num_places():

◼ omp_proc_bind_t omp_get_proc_bind(): returns the thread affinity policy


(omp_proc_bind_false, true, master, …)

◼ int omp_get_num_places(): returns the number of places

◼ int omp_get_place_num_procs(int place_num): returns the number of


processors in the given place

◼ void omp_get_place_proc_ids(int place_num, int* ids): returns the


ids of the processors in the given place

18 Advanced OpenMP
Places API (2/2)
◼ 2: Query information about the place partition:

◼ int omp_get_place_num(): returns the place number of the place to which the
current thread is bound

◼ int omp_get_partition_num_places(): returns the number of places in the


current partition

◼ void omp_get_partition_place_nums(int* pns): returns the list of place


numbers corresponding to the places in the current partition

19 Advanced OpenMP
Places API: Example
◼ Simple routine printing the processor ids of the place the calling thread
is bound to:
void print_binding_info() {
int my_place = omp_get_place_num();
int place_num_procs = omp_get_place_num_procs(my_place);

printf("Place consists of %d processors: ", place_num_procs);

int *place_processors = malloc(sizeof(int) * place_num_procs);


omp_get_place_proc_ids(my_place, place_processors)

for (int i = 0; i < place_num_procs - 1; i++) {


printf("%d ", place_processors[i]);
}
printf("\n");

free(place_processors);
}

20 Advanced OpenMP
OpenMP 5.0 way to do this
◼ Set OMP_DISPLAY_AFFINITY=TRUE
→Instructs the runtime to display formatted affinity information

→Example output for two threads on two physical cores:


nesting_level= 1, thread_num= 0, thread_affinity= 0,1
nesting_level= 1, thread_num= 1, thread_affinity= 2,3

→Output can be formatted with OMP_AFFINITY_FORMAT env var or


corresponding routine

→Formatted affinity information can be printed with


omp_display_affinity(const char* format)

21 Advanced OpenMP
Affinity format specification
t omp_get_team_num() a omp_get_ancestor_thread_num() at level-1
T omp_get_num_teams() H hostname
L omp_get_level() P process identifier
n omp_get_thread_num() i native thread identifier
N omp_get_num_threads() A thread affinity: list of processors (cores)

◼ Example:
OMP_AFFINITY_FORMAT=“Affinity: %0.3L %.8n %.15{A} %.12H“

→Possible output:
Affinity: 001 0 0-1,16-17 host003
Affinity: 001 1 2-3,18-19 host003

22 Advanced OpenMP
A first summary
◼ Everything under control?
◼ In principle Yes, but only if
→threads can be bound explicitly,
→data can be placed well by first-touch, or can be migrated,
→you focus on a specific platform (= OS + arch) → no portability

◼ What if the data access pattern changes over time?

◼ What if you use more than one level of parallelism?

23 Advanced OpenMP
NUMA Strategies: Overview
◼ First Touch: Modern operating systems (i.e., Linux >= 2.4) decide for a
physical location of a memory page during the first page fault, when
the page is first „touched“, and put it close to the CPU causing the
page fault.

◼ Explicit Migration: Selected regions of memory (pages) are moved


from one NUMA node to another via explicit OS syscall.

◼ Next Touch: Binding of pages to NUMA nodes is removed and pages


are migrated to the location of the next „touch“. Well-supported in
Solaris, expensive to implement in Linux.

◼ Automatic Migration: No support for this in current operating systems.

24 Advanced OpenMP
User Control of Memory Affinity
◼ Explicit NUMA-aware memory allocation:
→By carefully touching data by the thread which later uses it
→By changing the default memory allocation strategy
→Linux: numactl command
→Windows: VirtualAllocExNuma() (limited functionality)

→By explicit migration of memory pages


→Linux: move_pages()
→Windows: no option

◼ Example: using numactl to distribute pages round-robin:


→ numactl –interleave=all ./a.out

25 Advanced OpenMP
Improving Tasking Performance:
Task Affinity

26 Advanced OpenMP
Motivation
◼ Techniques for process binding & thread pinning available
→OpenMP thread level: OMP_PLACES & OMP_PROC_BIND

→OS functionality: taskset -c

OpenMP Tasking:
◼ In general: Tasks may be executed by any thread in the team
→Missing task-to-data affinity may have detrimental effect on performance

OpenMP 5.0:
◼ affinity clause to express affinity to data

27 Advanced OpenMP
affinity clause
◼ New clause: #pragma omp task affinity (list)
→Hint to the runtime to execute task closely to physical data location

→Clear separation between dependencies and affinity

◼ Expectations:
→Improve data locality / reduce remote memory accesses

→Decrease runtime variability

◼ Still expect task stealing


→In particular, if a thread is under-utilized
28 Advanced OpenMP
Code Example
◼ Excerpt from task-parallel STREAM
1 #pragma omp task \
2 shared(a, b, c, scalar) \
3 firstprivate(tmp_idx_start, tmp_idx_end) \
4 affinity( a[tmp_idx_start] )
5 {
6 int i;
7 for(i = tmp_idx_start; i <= tmp_idx_end; i++)
8 a[i] = b[i] + scalar * c[i];
9 }

→Loops have been blocked manually (see tmp_idx_start/end)

→Assumption: initialization and computation have same blocking and same affinity

29 Advanced OpenMP
Selected LLVM implementation details
Task with No
Encounter task Push to local
data
region … queue
affinity?
A map is introduced to
Yes store location information
of data that was previously
Location
No for data used
reference in
map?
Jannis Klinkenberg, Philipp Samfass,
Christian Terboven, Alejandro Duran,
Identify NUMA Yes Michael Klemm, Xavier Teruel, Sergi
domain where Mateo, Stephen L. Olivier, and Matthias
data is stored S. Müller. Assessing Task-to-Data Affinity
in the LLVM OpenMP Runtime.
Proceedings of the 14th International
Workshop on OpenMP, IWOMP 2018.
Select thread Save Push task into September 26-28, 2018, Barcelona,
pinned to {reference, other threads end Spain.
NUMA domain location} in map queue

30 Advanced OpenMP
Evaluation
Program runtime Distribution of single
Median of 10 runs task execution times

Speedup
of 4.3 X

LIKWID: reduction of remote data volume from 69% to 13%


31 Advanced OpenMP
Summary
◼ Requirement for this feature: thread affinity enabled

◼ The affinity clause helps, if


→tasks access data heavily

→single task creator scenario, or task not created with data affinity

→high load imbalance among the tasks

◼ Different from thread binding: task stealing is absolutely allowed

32 Advanced OpenMP
Managing Memory Spaces

33 Advanced OpenMP
Different kinds of memory
◼ Traditional DDR-based memory
◼ High-bandwidth memory
◼ Non-volatile memory
◼…

34 Advanced OpenMP
Memory Management
◼ Allocator := an OpenMP object that fulfills requests to allocate and
deallocate storage for program variables

◼ OpenMP allocators are of type omp_allocator_handle_t

◼ Default allocator for Host


→via OMP_ALLOCATOR env. var. or corresponding API

◼ OpenMP 5.0 supports a set of memory allocators

35 Advanced OpenMP
OpenMP Allocators
◼ Selection of a certain kind of memory
Allocator name Storage selection intent
omp_default_mem_alloc use default storage
omp_large_cap_mem_alloc use storage with large capacity
omp_const_mem_alloc use storage optimized for read-only variables
omp_high_bw_mem_alloc use storage with high bandwidth
omp_low_lat_mem_alloc use storage with low latency
omp_cgroup_mem_alloc use storage close to all threads in the contention group
of the thread requesting the allocation
omp_pteam_mem_alloc use storage that is close to all threads in the same
parallel region of the thread requesting the allocation
omp_thread_local_mem_alloc use storage that is close to the thread requesting the
allocation

36 Advanced OpenMP
Using OpenMP Allocators
◼ New clause on all constructs with data sharing clauses:
→ allocate( [allocator:] list )
◼ Allocation:
→ omp_alloc(size_t size, omp_allocator_handle_t allocator)

◼ Deallocation:
→ omp_free(void *ptr, const omp_allocator_handle_t allocator)

→ allocator argument is optional


◼ allocate directive: standalone directive for allocation, or declaration of allocation
stmt.

37 Advanced OpenMP
OpenMP Allocator Traits / 1
◼ Allocator traits control the behavior of the allocator
sync_hint contended, uncontended, serialized, private
default: contended
alignment positive integer value that is a power of two
default: 1 byte
access all, cgroup, pteam, thread
default: all
pool_size positive integer value

fallback default_mem_fb, null_fb, abort_fb, allocator_fb


default: default_mem_fb
fb_data an allocator handle

pinned true, false


default: false
partition environment, nearest, blocked, interleaved
default: environment
38 Advanced OpenMP
OpenMP Allocator Traits / 2

◼ fallback: describes the behavior if the allocation cannot be fulfilled


→default_mem_fb: return system’s default memory

→Other options: null, abort, or use different allocator

◼ pinned: request pinned memory, i.e. for GPUs

39 Advanced OpenMP
OpenMP Allocator Traits / 3
◼ partition: partitioning of allocated memory of physical storage
resources (think of NUMA)
→environment: use system’s default behavior

→nearest: most closest memory

→blocked: partitioning into approx. same size with at most one block per
storage resource

→interleaved: partitioning in a round-robin fashion across the storage


resources

40 Advanced OpenMP
OpenMP Allocator Traits / 4
◼ Construction of allocators with traits via
→omp_allocator_handle_t omp_init_allocator(
omp_memspace_handle_t memspace,
int ntraits, const omp_alloctrait_t traits[]);

→Selection of memory space mandatory

→Empty traits set: use defaults


◼ Allocators have to be destroyed with *_destroy_*
◼ Custom allocator can be made default with
omp_set_default_allocator(omp_allocator_handle_t allocator)

41 Advanced OpenMP
OpenMP Memory Spaces
◼ Storage resources with explicit support in OpenMP:
omp_default_mem_space System’s default memory resource
omp_large_cap_mem_space Storage with larg(er) capacity
omp_const_mem_space Storage optimized for variables with constant value
omp_high_bw_mem_space Storage with high bandwidth
omp_low_lat_mem_space Storage with low latency

→Exact selection of memory space is implementation-def.

→Pre-defined allocators available to work with these

42 Advanced OpenMP

You might also like