OpenMP Workshop Day 2
OpenMP Workshop Day 2
8 8Advanced OpenMP
Sudoko for Lazy Computer Scientists
◼ Lets solve Sudoku puzzles with brute multi-core force
◼ (1) Search an empty field
Advanced OpenMP
9
Parallel Brute-force Sudoku
◼ This parallel algorithm finds all valid solutions
◼ (1) Search an empty fieldfirst call contained in a
#pragma omp parallel
#pragma omp single
such that one tasks starts the
◼ (2) Try all numbers: execution of the algorithm
◼ (2 a) Check Sudoku
◼ If invalid: skip
◼ If valid: Go to next#pragma
field omp task
needs to work on a new copy
of the Sudoku board
Advanced OpenMP
10
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding
8 4.0
7 3.5
6 3.0
Speedup
4 2.0
3 1.5
2 1.0
1 0.5
0 0.0
1 2 3 4 5 6 7 8 9 10 11 12 16 24 32
#threads
Advanced OpenMP
11
Tasking Overview
Advanced OpenMP
1
What is a task in OpenMP?
◼ Tasks are work units whose execution
→ may be deferred or…
… when encountering a taskloop construct → explicit tasks per chunk are created
Advanced OpenMP
2
Tasking execution model
◼ Supports unstructured parallelism ◼ Example (unstructured parallelism)
→ unbounded loops #pragma omp parallel
#pragma omp master
while ( <expr> ) {
while (elem != NULL) {
...
#pragma omp task
}
compute(elem);
elem = elem->next;
→ recursive functions }
void myfunc( <args> )
{
...; myfunc( <newargs> ); ...;
} Parallel Team
Advanced OpenMP
3
The task construct
◼ Deferring (or not) a unit of work (executable for any member of the team)
#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]
{structured-block} …structured-block…
!$omp end task
Advanced OpenMP
4
Task scheduling: tied vs untied tasks
◼ Tasks are tied by default (when no untied clause present)
→ tied tasks are executed always by the same thread (not necessarily creator)
→ bad mix with thread based features: thread-id, threadprivate, critical regions...
Advanced OpenMP
5
Task scheduling: taskyield directive
◼ Task scheduling points (and the taskyield directive)
→ tasks can be suspended/resumed at TSPs → some additional constraints to avoid deadlock problems
→ once a thread becomes idle, gets one of the highest priority tasks
Advanced OpenMP
7
Task synchronization: taskwait directive
◼ The taskwait directive (shallow task synchronization)
→ It is a stand-alone directive
#pragma omp taskwait
→ wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)
→ And all other implicit barriers at parallel, sections, for, single, etc…
Advanced OpenMP
9
Task synchronization: taskgroup construct
◼ The taskgroup construct (deep task synchronization)
→ attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}
Advanced OpenMP
11
Explicit data-sharing clauses
◼ Explicit data-sharing clauses (shared, private and firstprivate)
#pragma omp task shared(a) #pragma omp task private(b) #pragma omp task firstprivate(c)
{ { {
// Scope of a: shared // Scope of b: private // Scope of c: firstprivate
} } }
→ none: compiler will issue an error if the attribute is not explicitly set by the programmer (very useful!!!)
#pragma omp task default(shared) #pragma omp task default(none)
{ {
// Scope of all the references, not explicitly // Compiler will force to specify the scope for
// included in any other data sharing clause, // every single variable referenced in the context
// and with no pre-determined attribute: shared }
}
Hint: Use default(none) to be forced to think about every
variable if you do not see clearly.
Advanced OpenMP
12
Pre-determined data-sharing attributes
◼ threadprivate variables are threadprivate (1) #pragma omp task 5
{
◼ dynamic storage duration objects are shared (malloc, new,… ) (2) int x = MN;
◼ static data members are shared (3) // Scope of x: private
}
◼ variables declared inside the construct
→ static storage duration variables are shared (4) #pragma omp task 4
{
static int y;
→ automatic storage duration variables are private (5) // Scope of y: shared
}
◼ the loop iteration variable(s)…
Advanced OpenMP
13
Implicit data-sharing attributes (in-practice)
◼ Implicit data-sharing rules for the task region → Pre-determined rules (could not change)
→ the shared attribute is lexically inherited → Explicit data-sharing clauses (+ default)
→ in any other case the variable is firstprivate → Implicit data-sharing rules
Advanced OpenMP
14
Task reductions (using taskgroup)
int res = 0;
◼ Reduction operation node_t* node = NULL;
→ perform some forms of recurrence calculations ...
#pragma omp parallel
→ associative and commutative operators {
#pragma omp single
◼ The (taskgroup) scoping reduction clause {
#pragma omp taskgroup task_reduction(op: list) #pragma omp taskgroup task_reduction(+: res)
{ // [1]
{structured-block}
while (node) {
#pragma omp task in_reduction(+: res) \
→ Register a new reduction at [1] firstprivate(node)
{ // [2]
→ Computes the final result after [3] res += node->value;
◼ The (task) in_reduction clause [participating] }
node = node->next;
#pragma omp task in_reduction(op: list) }
{structured-block} } // [3]
}
→ Task participates in a reduction operation [2] }
Advanced OpenMP
15
Task reductions (+ modifiers)
int res = 0;
◼ Reduction modifiers node_t* node = NULL;
→ Former reductions clauses have been extended ...
#pragma omp parallel reduction(task,+: res)
→ task modifier allows to express task reductions { // [1][2]
#pragma omp single
→ Registering a new task reduction [1] {
#pragma omp taskgroup
→ Implicit tasks participate in the reduction [2] {
while (node) {
→ Compute final result after [4] #pragma omp task in_reduction(+: res) \
firstprivate(node)
◼ The (task) in_reduction clause [participating]
{ // [3]
#pragma omp task in_reduction(op: list) res += node->value;
{structured-block} }
node = node->next;
}
→ Task participates in a reduction operation [3] }
}
} // [4]
Advanced OpenMP
16
Tasking illustrated
Advanced OpenMP
17
Fibonacci illustrated
1 int main(int argc, 14 int fib(int n) {
2 char* argv[]) 15 if (n < 2) return n;
3 { 16 int x, y;
4 [...] 17 #pragma omp task shared(x)
5 #pragma omp parallel 18 {
6 { 19 x = fib(n - 1);
7 #pragma omp single 20 }
8 { 21 #pragma omp task shared(y)
9 fib(input); 22 {
10 } 23 y = fib(n - 2);
11 } 24 }
12 [...] 25 #pragma omp taskwait
13 } 26 return x+y;
27 }
◼ Only one Task / Thread enters fib() from main(), it is responsible for
creating the two initial work tasks
◼ Taskwait is required, as otherwise x and y would get lost
Advanced OpenMP
18
◼ T1 enters fib(4)
◼ T1 creates tasks for
fib(3) and fib(2) fib(4)
◼ T1 and T2 execute tasks
from the queue
◼ T1 and T2 create 4 new fib(3) fib(2)
tasks
◼ T1 - T4 execute tasks
Task Queue
fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)
Advanced OpenMP
19
◼ T1 enters fib(4)
◼ T1 creates tasks for
fib(3) and fib(2) fib(4)
◼ T1 and T2 execute tasks
from the queue
◼ T1 and T2 create 4 new fib(3) fib(2)
tasks
◼ T1 - T4 execute tasks
◼ … fib(2) fib(1) fib(1) fib(0)
fib(1) fib(0)
Advanced OpenMP
20
The taskloop Construct
Advanced OpenMP
1
Tasking use case: saxpy (taskloop)
for ( i = 0; i<SIZE; i+=1) {
A[i]=A[i]*B[i]*S;
◼ Difficult to determine grain
} → 1 single iteration → to fine
for ( i = 0; i<SIZE; i+=TS) { → whole loop → no parallelism
UB = SIZE < (i+TS)?SIZE:i+TS;
for ( ii=i; ii<UB; ii++) { ◼ Manually transform the code
A[ii]=A[ii]*B[ii]*S;
→ blocking techniques
}
} ◼ Improving programmability
→ OpenMP taskloop
#pragma omp parallel
#pragma omp single #pragma omp taskloop grainsize(TS)
for ( i = 0; i<SIZE; i+=TS) { for ( i = 0; i<SIZE; i+=1) {
UB = SIZE < (i+TS)?SIZE:i+TS; A[i]=A[i]*B[i]*S;
#pragma omp task private(ii) \ }
firstprivate(i,UB) shared(S,A,B)
for ( ii=i; ii<UB; ii++) { → Hiding the internal details
A[ii]=A[ii]*B[ii]*S;
} → Grain size ~ Tile size (TS) → but implementation
}
decides exact grain size
Advanced OpenMP
2
The taskloop Construct
◼ Task generating construct: decompose a loop into chunks, create a task for each loop chunk
#pragma omp taskloop [clause[[,] clause]…] !$omp taskloop [clause[[,] clause]…]
{structured-for-loops} …structured-do-loops…
!$omp end taskloop
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
Advanced OpenMP
4
Worksharing vs. taskloop constructs (2/2)
subroutine worksharing subroutine taskloop
integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine
Advanced OpenMP
5
Taskloop decomposition approaches
◼ Clause: grainsize(grain-size) ◼ Clause: num_tasks(num-tasks)
→ Chunks have at least grain-size iterations → Create num-tasks chunks
→ Chunks have maximum 2x grain-size iterations → Each chunk must have at least one iteration
int TS = 4 * 1024; int NT = 4 * omp_get_num_threads();
#pragma omp taskloop grainsize(TS) #pragma omp taskloop num_tasks(NT)
for ( i = 0; i<SIZE; i+=1) { for ( i = 0; i<SIZE; i+=1) {
A[i]=A[i]*B[i]*S; A[i]=A[i]*B[i]*S;
} }
◼ If none of previous clauses is present, the number of chunks and the number of iterations per chunk
is implementation defined
◼ Additional considerations:
→ The order of the creation of the loop tasks is unspecified
→ Taskloop creates an implicit taskgroup region; nogroup → no implicit taskgroup region is created
Advanced OpenMP
6
Collapsing iteration spaces with taskloop
◼ The collapse clause in the taskloop construct #pragma omp taskloop collapse(2)
for ( i = 0; i<SX; i+=1) {
#pragma omp taskloop collapse(n) for ( j= 0; i<SY; j+=1) {
{structured-for-loops} for ( k = 0; i<SZ; k+=1) {
A[f(i,j,k)]=<expression>;
}
→ Number of loops associated with the taskloop construct (n) }
}
→ Loops are collapsed into one larger iteration space
Advanced OpenMP
7
Task reductions (using taskloop)
◼ Clause: reduction(r-id: list) double dotprod(int n, double *x, double *y) {
double r = 0.0;
→ It defines the scope of a new reduction #pragma omp taskloop reduction(+: r)
for (i = 0; i < n; i++)
→ All created tasks participate in the reduction r += x[i] * y[i];
→ It cannot be used with the nogroup clause return r;
}
Advanced OpenMP
8
Composite construct: taskloop simd
◼ Task generating construct: decompose a loop into chunks, create a task for each loop chunk
◼ Each generated task will apply (internally) SIMD to each loop chunk
→ C/C++ syntax:
#pragma omp taskloop simd [clause[[,] clause]…]
{structured-for-loops}
→ Fortran syntax:
!$omp taskloop simd [clause[[,] clause]…]
…structured-do-loops…
!$omp end taskloop
Advanced OpenMP
9
Improving Tasking Performance:
Task dependences
Advanced OpenMP
1
Motivation
◼ Task dependences as a way to define task-execution constraints
int x = 0; OpenMP 3.1 int x = 0; OpenMP 4.0
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;
t1
OpenMP 3.1
t2
Task’s creation time
OpenMP 4.0 t1
t2
Advanced OpenMP
2
Motivation
◼ Task dependences as a way to define task-execution constraints
int x = 0; OpenMP 3.1 int x = 0; OpenMP 4.0
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;
Task dependences can help us to remove
#pragma omp taskwait “strong” synchronizations, increasing the look
#pragma omp task ahead and, frequently, the#pragma
parallelism!!!!
omp task depend(inout: x)
x++; x++;
} }
t1
OpenMP 3.1
t2
Task’s creation time
OpenMP 4.0 t1
t2
Advanced OpenMP
3
Motivation: Cholesky factorization
void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
potrf(a[k][k], ts, ts); ts #pragma omp task depend(inout: a[k][k])
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
OpenMP 3.1 } OpenMP 4.0
Advanced OpenMP
4
Motivation: Cholesky factorization
void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
potrf(a[k][k], ts, ts); ts #pragma omp task depend(inout: a[k][k])
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
OpenMP 3.1 } OpenMP 4.0
Advanced OpenMP
Using 2017 Intel compiler
5
What’s in the spec
Advanced OpenMP
6
What’s in the spec: a bit of history
OpenMP 4.0 OpenMP 4.5
• The depend clause was added • The depend clause was added to the
to the task construct target constructs
• Support to doacross loops
OpenMP 5.0
Advanced OpenMP
7
What’s in the spec: syntax depend clause
where:
→ depend-modifier is used to define iterators
Advanced OpenMP
8
What’s in the spec: sema depend clause (1)
◼ A task cannot be executed until all its predecessor tasks are completed
Advanced OpenMP
9
What’s in the spec: depend clause (1)
◼ A task cannot be executed until all its predecessor tasks are completed
int x = 0;
◼ If a task defines an in
#pragma dependence over a variable
omp parallel
#pragma omp single
→ the task will depend
{ on all previously generated sibling tasks that reference at least one of the list items in
#pragma omp task depend(inout: x) //T1 T1
an out or inout dependence
{ ... }
Advanced OpenMP
10
What’s in the spec: depend clause (2)
◼ New dependency type: mutexinoutset
T1 T0 T2
int x = 0, y = 0, res = 0;
#pragma omp parallel
#pragma omp single
{ T3
#pragma omp task depend(out: res) //T0 T4
res = 0;
T3
#pragma omp task depend(out: x) //T1 T4
long_computation(x);
#pragma omp task depend(in: res) //T5 2. mutex property: Tasks inside the inout set can be
std::cout << res << std::endl; executed in any order but with mutual exclusion
}
Advanced OpenMP
11
What’s in the spec: depend clause (4)
◼ Task dependences are ◼ List items used in the depend
defined among sibling tasks clauses […] must indicate identical
or disjoint storage
//test1.cc //test2.cc
int x = 0; int a[100] = {0};
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task depend(inout: x) //T1 #pragma omp task depend(inout: a[50:99]) //T1
{ compute(/* from */ &a[50], /*elems*/ 50);
#pragma omp task depend(inout: x) //T1.1 T1
x++; #pragma omp task depend(in: a) //T2
print(/* from */ a, /* elem */ 100);
#pragma omp taskwait } ???
}
#pragma omp task depend(in: x) //T2
std::cout << x << std::endl; T2
}
Advanced OpenMP
12
What’s in the spec: depend clause (5)
◼ Iterators + deps: a way to define a dynamic number of dependences
Advanced OpenMP
13
Philosophy
Advanced OpenMP
14
Philosophy: data-flow model
◼ Task dependences are orthogonal to data-sharings
→ Dependences as a way to define a task-execution constraints
// test1.cc // test2.cc
int x = 0; int x = 0;
#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task depend(inout: x) \ #pragma omp task depend(inout: x) //T1
firstprivate(x) //T1 x++;
x++;
#pragma omp task depend(in: x) \
#pragma omp task depend(in: x) //T2 firstprivate(x) //T2
std::cout << x << std::endl; std::cout << x << std::endl;
} }
Advanced OpenMP
16
Philosophy: data-flow model (3)
//test1_v1.cc //test1_v2.cc
int x = 0, y = 0; int//test1_v3.cc
x = 0, y = 0;
#pragma omp parallel #pragma
int x omp
= 0,parallel
y = 0;
#pragma omp single //test1_v4.cc
#pragma omp single
#pragma
int x omp
= 0,parallel
y = 0;
{ { #pragma omp single
#pragma omp task depend(inout: x) //T1 #pragma
#pragma ompomp parallel
task depend(inout: x) //T1
{ #pragma omp single
{ { #pragma omp task depend(inout: x) //T1
x++; {
x++;
{ #pragma omp task depend(inout: x, y) //T1
y++; // !!! y++;
x++;
} } {
y++;
x++;
#pragma omp task depend(in: x) //T2 #pragma
} omp task depend(in: x) //T2
std::cout << x << std::endl; y++;
std::cout <<
#pragma
} ompx task
<< std::endl;
depend(in: x) //T2
std::cout
#pragma <<
ompxtask
<< std::endl;
depend(in:
#pragma omp taskwait #pragma omp task depend(in: y) x)//T3 //T2
std::cout << y << std::endl; std::couty <<
std::cout << xstd::endl;
<< std::endl;
#pragma <<omp task depend(in: x) //T3
} } std::cout
#pragma <<ompytask
<< std::endl;
depend(in: y) //T3
} std::cout << y << std::endl;
}
Advanced OpenMP
18
Use case: intro to Gauss-seidel
void serial_gauss_seidel(int tsteps, int size, int (*p)[size]) { Access pattern analysis
for (int t = 0; t < tsteps; ++t) { For a specific t, i and j
for (int i = 1; i < size-1; ++i) {
for (int j = 1; j < size-1; ++j) {
p[i][j] = 0.25 * (p[i][j-1] * // left
p[i][j+1] * // right tn
p[i-1][j] * // top
p[i+1][j]); // bottom
}
}
}
} Each cell depends on:
- two cells (north & west) that are
computed in the current time step, and
- two cells (south & east) that were
computed in the previous time step
Advanced OpenMP
19
Use case: Gauss-seidel (2)
void serial_gauss_seidel(int tsteps, int size, int (*p)[size]) { 1st parallelization strategy
for (int t = 0; t < tsteps; ++t) { For an specific t
for (int i = 1; i < size-1; ++i) {
for (int j = 1; j < size-1; ++j) {
p[i][j] = 0.25 * (p[i][j-1] * // left
p[i][j+1] * // right tn
p[i-1][j] * // top
p[i+1][j]); // bottom
}
}
}
}
Advanced OpenMP
20
Use case : Gauss-seidel (3)
void gauss_seidel(int tsteps, int size, int TS, int (*p)[size]) {
int NB = size / TS;
#pragma omp parallel
for (int t = 0; t < tsteps; ++t) {
// First NB diagonals
for (int diag = 0; diag < NB; ++diag) {
#pragma omp for
for (int d = 0; d <= diag; ++d) {
int ii = d;
int jj = diag – d;
for (int i = 1+ii*TS; i < ((ii+1)*TS); ++i)
for (int j = 1+jj*TS; i < ((jj+1)*TS); ++j)
p[i][j] = 0.25 * (p[i][j-1] * p[i][j+1] *
p[i-1][j] * p[i+1][j]);
}
}
// Lasts NB diagonals
for (int diag = NB-1; diag >= 0; --diag) {
// Similar code to the previous loop
}
}
}
Advanced OpenMP
21
Use case : Gauss-seidel (4)
void serial_gauss_seidel(int tsteps, int size, int (*p)[size]) { 2nd parallelization strategy
for (int t = 0; t < tsteps; ++t) { multiple time iterations
for (int i = 1; i < size-1; ++i) {
for (int j = 1; j < size-1; ++j) {
p[i][j] = 0.25 * (p[i][j-1] * // left
p[i][j+1] * // right tn
p[i-1][j] * // top tn+1
p[i+1][j]); // bottom tn+2
} tn+3
}
}
}
Advanced OpenMP
23
Use case : Gauss-seidel (5)
void gauss_seidel(int tsteps, int size, int TS, int (*p)[size]) { inner matrix region
int NB = size / TS;
Advanced OpenMP
24
OpenMP 5.0: (even) more advanced features
Advanced OpenMP
25
Advanced features: deps on taskwait
◼ Adding dependences to the taskwait construct
→Using a taskwait construct to explicitly wait for some predecessor tasks
→Syntactic sugar!
int x = 0, y = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp task depend(inout: x) //T1
x++;
#pragma omp task depend(in: y) //T2
std::cout << y << std::endl;
#pragma omp taskwait depend(in: x)
std::cout << x << std::endl;
}
Advanced OpenMP
26
Advanced features: dependable objects (1)
◼ Offer a way to manually handle dependences
→Useful for complex task dependences
Advanced OpenMP
27
Advanced features: dependable objects (2)
◼ Offer a way to manually handle dependences
int x = 0;
#pragma omp parallel
#pragma omp single
{
int x = 0;
omp_depend_t obj;
#pragma omp parallel
T1 #pragma omp depobj(obj) depend(inout: x)
#pragma omp single
{
#pragma omp task depend(depobj: obj) //T1
#pragma omp task depend(inout: x) //T1
x++;
x++;
#pragma omp depobj(obj) update(in)
#pragma omp task depend(in: x) //T2
std::cout << x << std::endl; T2
#pragma omp task depend(depobj: obj) //T2
}
std::cout << x << std::endl;
Advanced OpenMP
28
Cancellation
Advanced OpenMP
1
OpenMP 3.1 Parallel Abort
◼ Once started, parallel execution cannot be aborted in OpenMP 3.1
→ Code regions must always run to completion
→ (or not start at all)
Advanced OpenMP
2
Cancellation Constructs
◼ Two constructs:
→ Activate cancellation:
C/C++: #pragma omp cancel
Fortran: !$omp cancel
→ Check for cancellation:
C/C++: #pragma omp cancellation point
Fortran: !$omp cancellation point
◼ Check for cancellation only a certain points
→ Avoid unnecessary overheads
→ Programmers need to reason about cancellation
→ Cleanup code needs to be added manually
Advanced OpenMP
3
Cancellation Semantics
Thread A Thread B Thread C
parallel region
Advanced OpenMP
4
Cancellation Semantics
Thread A Thread B Thread C
parallel region
Advanced OpenMP
5
Cancellation Semantics
Thread A Thread B Thread C
parallel region
Advanced OpenMP
6
Cancellation Semantics
Thread A Thread B Thread C
parallel region
Advanced OpenMP
7
cancel Construct
◼ Syntax:
#pragma omp cancel construct-type-clause [ [, ]if-clause]
!$omp cancel construct-type-clause [ [, ]if-clause]
◼ Clauses:
parallel
sections
for (C/C++)
do (Fortran)
taskgroup
if (scalar-expression)
◼ Semantics
→ Requests cancellation of the inner-most OpenMP region of the type specified in
construct-type-clause
→ Lets the encountering thread/task proceed to the end of the canceled region
Advanced OpenMP
8
cancellation point Construct
◼ Syntax:
#pragma omp cancellation point construct-type-clause
!$omp cancellation point construct-type-clause
◼ Clauses:
parallel
sections
for (C/C++)
do (Fortran)
taskgroup
◼ Semantics
→ Introduces a user-defined cancellation point
→ Pre-defined cancellation points:
→ implicit/explicit barriers regions
→ cancel regions
Advanced OpenMP
9
Cancellation of OpenMP Tasks
◼ Cancellation only acts on tasks grouped by the taskgroup construct
→ The encountering tasks jumps to the end of its task region
→ Any executing task will run to completion
(or until they reach a cancellation point region)
→ Any task that has not yet begun execution may be discarded
(and is considered completed)
Advanced OpenMP
10
Task Cancellation Example
binary_tree_t* search_tree_parallel(binary_tree_t* tree, int value) {
binary_tree_t* found = NULL;
#pragma omp parallel shared(found,tree,value)
{
#pragma omp master
{
#pragma omp taskgroup
{
found = search_tree(tree, value);
}
}
}
return found;
}
Advanced OpenMP
11
Task Cancellation Example
binary_tree_t* search_tree(
binary_tree_t* tree, int value,
int level) { #pragma omp task shared(found)
binary_tree_t* found = NULL; {
if (tree) { binary_tree_t* found_right;
if (tree->value == value) { found_right =
found = tree; search_tree(tree->right, value);
} if (found_right) {
else { #pragma omp atomic write
#pragma omp task shared(found) found = found_right;
{ #pragma omp cancel taskgroup
binary_tree_t* found_left; }
found_left = }
search_tree(tree->left, value); #pragma omp taskwait
if (found_left) { }
#pragma omp atomic write }
found = found_left; return found;
#pragma omp cancel taskgroup }
}
}
Advanced OpenMP
12
Improving Tasking Performance:
Cutoff clauses and strategies
Advanced OpenMP
1
Example: Sudoku revisited
Advanced OpenMP
2
Parallel Brute-force Sudoku
◼ This parallel algorithm finds all valid solutions
◼ (1) Search an empty fieldfirst call contained in a
#pragma omp parallel
#pragma omp single
such that one tasks starts the
◼ (2) Try all numbers: execution of the algorithm
◼ (2 a) Check Sudoku
◼ If invalid: skip
◼ If valid: Go to next#pragma omp task
needs to work on a new copy
field of the Sudoku board
Advanced OpenMP
3
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding
8 4.0
7 3.5
Runtime [sec] for 16x16
6 3.0
5 2.5
Speedup
4 2.0
3 1.5
2 1.0
1 0.5
0 0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#threads
Advanced OpenMP
4
Performance Analysis
Event-based profiling provides a
Tracing provides more details:
good overview :
lvl 6
lvl 12
Every thread is executing ~1.3m tasks… Duration: 0.047 sec
lvl 48
Duration: 0.001 sec
lvl 82
Duration: 2.2 μs
… in ~5.7 seconds. Tasks get much smaller
=> average duration of a task is ~4.4 μs down the call-stack.
Advanced OpenMP
5
Performance Analysis
Event-based profiling provides a
Tracing provides more details:
good overview :
lvl 6
lvl 48
Duration: 0.001 sec
lvl 82
Duration: 2.2 μs
… in ~5.7 seconds. Tasks get much smaller
=> average duration of a task is ~4.4 μs down the call-stack.
Advanced OpenMP
6
Performance Evaluation (with cutoff)
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding Intel C++ 13.1, scatter binding, cutoff
speedup: Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding, cutoff
8 18
7 16
Runtime [sec] for 16x16
14
6
12
5
Speedup
10
4
8
3
6
2
4
1 2
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#threads
Advanced OpenMP
7
The if clause
◼ Rule of thumb: the if(expression)clause as a “switch off” mechanism
→ Allows lightweight implementations of task creation and execution but it reduces the parallelism
Advanced OpenMP
8
The final clause
◼ The final(expression) clause
→ Nested tasks / recursive applications
→ allows to avoid future task creation → reduces overhead but also reduces parallelism
Advanced OpenMP
9
The mergeable clause
◼ The mergeable clause
→ Optimization: get rid of “data-sharing clauses are honored”
◼ A Task that is annotated with the mergeable clause is called a mergeable task
→ A task that may be a merged task if it is an undeferred task or an included task
◼ A good implementation could execute a merged task without adding any OpenMP-
related overhead Unfortunately, there are no OpenMP
commercial implementations taking
advantage of final neither mergeable =(
Advanced OpenMP
10
Vectorization w/ OpenMP SIMD
1 Advanced OpenMP
Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY
THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY
EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY
OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY
RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and
reflect the approximate performance of Intel products as measured by those tests. Any difference in
system hardware or software design or configuration may affect actual performance. Buyers should
consult other sources of information to evaluate the performance of systems or components they are
considering purchasing. For more information on performance tests and on the performance of Intel
products, reference www.intel.com/software/products.
All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, VTune, and Cilk are trademarks of Intel
Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations
that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and
Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
2 Advanced OpenMP
Evolution of Intel Hardware
64-bit Intel® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon®
Xeon® processor processor processor processor E5- Scalable
processor 5100 series 5500 series 5600 series 2600v3 series Processor
Frequency 3.6 GHz 3.0 GHz 3.2 GHz 3.3 GHz 2.3 GHz 2.5 GHz
Core(s) 1 2 4 6 18 28
Thread(s) 2 2 8 12 36 56
128 128 128 128 256 512
SIMD width
(2 clock) (1 clock) (1 clock) (1 clock) (1 clock) (1 clock)
3 Advanced OpenMP
Levels of Parallelism
◼ OpenMP already supports several levels of parallelism in today’s hardware
4 Advanced OpenMP
SIMD on Intel® Architecture
◼ Width of SIMD registers has been growing in the past:
128 bit
2 x DP
SSE
4 x SP
256 bit
4 x DP
AVX
8 x SP
512 bit
8 x DP
AVX-512
16 x SP
5 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vaddpd dest, source1, source2
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2
=
a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0
dest
6 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vfmadd213pd source1, source2, source3
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
*
b7 b6 b5 b4 b3 b2 b1 b0 source2
+
c7 c6 c5 c4 c3 c2 c1 c0 source3
=
a7*b7 a6*b6 a5*b5 a4 *b4 a3*b3 a2*b2 a1*b1 a0*b0 dest
+c7 +c6 +c5 +c4 +c3 +c2 +c1 +c0
7 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vaddpd dest{k1}, source2, source3
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2
1 0 1 0 0 1 0 1 mask
=
a7+b7 d6 a5+b5 d4 d3 a2+b2 d1 a0+b0 dest
8 Advanced OpenMP
More Powerful SIMD Units
◼ SIMD instructions become more powerful
◼ One example is the Intel® Xeon Phi™ Coprocessor
vmovapd dest, source{dacb}
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source
swizzle
a7 a4 a6 a5 a3 a0 a2 a1 “tmp”
move
a7 a4 a6 a5 a3 a0 a2 a1 dest
9 Advanced OpenMP
Auto-vectorization
◼ Compilers offer auto-vectorization as an optimization pass
→Usually part of the general loop optimization passes
→Code analysis detects code properties that inhibit SIMD vectorization
→Heuristics determine if SIMD execution might be beneficial
?
→If all goes well, the compiler will generate SIMD instructions
10 Advanced OpenMP
Why Auto-vectorizers Fail
◼ Data dependencies
◼ Other potential reasons
→Alignment
→Function calls in loop block
→Complex control flow / conditional branches
→Loop not “countable”
→e.g., upper bound not a runtime constant
→Mixed data types
→Non-unit stride between elements
→Loop body too complex (register pressure)
→Vectorization seems inefficient
◼ Many more … but less likely to occur
11 Advanced OpenMP
Data Dependencies
◼ Suppose two statements S1 and S2
◼ S2 depends on S1, iff S1 must execute before S2
→Control-flow dependence
→Data dependence
→Dependencies can be carried over between loop iterations
◼ Important flavors of data dependencies
FLOW ANTI
s1: a = 40 b = 40
b = 21 s1: a = b + 1
s2: c = a + 2 s2: b = 21
12 Advanced OpenMP
Loop-Carried Dependencies
◼ Dependencies may occur across loop iterations
→Loop-carried dependency
◼ The following code contains such a dependency:
void lcd_ex(float* a, float* b, size_t n, float c1, float c2)
{
size_t i;
for (i = 0; i < n; i++) {
a[i] = c1 * a[i + 17] + c2 * b[i];
}
}
Loop-carried dependency for a[i] and
◼ Some iterations of the loop have to a[i+17]; distance is 17.
complete before the next iteration can run
→Simple trick: Can you reverse the loop w/o getting wrong results?
13 Advanced OpenMP
Loop-carried Dependencies
◼ Can we parallelize or vectorize the loop?
void lcd_ex(float* a, float* b, size_t n, float c1, float c2) {
for (int i = 0; i < n; i++) {
a[i] = c1 * a[i + 17] + c2 * b[i];
} }
Thread 1 Thread 2
0 1 2 3 17 18 19 20
→ Parallelization: no
(except for very specific loop schedules)
→ Vectorization: yes
(iff vector length is shorter than any distance of any dependency)
14 Advanced OpenMP
Example: Loop not Countable
◼ “Loop not Countable” plus “Assumed Dependencies”
typedef struct {
float* data;
size_t size;
} vec_t;
15 Advanced OpenMP
In a Time Before OpenMP 4.0
◼ Support required vendor-specific extensions
→Programming models (e.g., Intel® Cilk Plus)
→Compiler pragmas (e.g., #pragma vector)
→Low-level constructs (e.g., _mm_add_pd())
16 Advanced OpenMP
SIMD Loop Construct
◼ Vectorize a loop nest
→Cut loop into chunks that fit a SIMD vector register
→No parallelization of the loop body
◼ Syntax (C/C++)
#pragma omp simd [clause[[,] clause],…]
for-loops
◼ Syntax (Fortran)
!$omp simd [clause[[,] clause],…]
do-loops
[!$omp end simd]
17 Advanced OpenMP
Example
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
vectorize
18 Advanced OpenMP
Data Sharing Clauses
◼ private(var-list):
Uninitialized vectors for variables in var-list
x: 42 ? ? ? ?
◼ firstprivate(var-list):
Initialized vectors for variables in var-list
x: 42 42 42 42 42
◼ reduction(op:var-list):
Create private variables for var-list and apply reduction operator op at the end of the construct
12 5 8 17 x: 42
19 Advanced OpenMP
SIMD Loop Clauses
◼ safelen (length)
→Maximum number of iterations that can run concurrently without breaking a
dependence
→In practice, maximum vector length
◼ linear (list[:linear-step])
→The variable’s value is in relationship with the iteration number
→xi = xorig + i * linear-step
◼ aligned (list[:alignment])
→Specifies that the list items have a given alignment
→Default is alignment for the architecture
◼ collapse (n)
20 Advanced OpenMP
SIMD Worksharing Construct
◼ Parallelize and vectorize a loop nest
→Distribute a loop’s iteration space across a thread team
→Subdivide loop chunks to fit a SIMD vector register
◼ Syntax (C/C++)
#pragma omp for simd [clause[[,] clause],…]
for-loops
◼ Syntax (Fortran)
!$omp do simd [clause[[,] clause],…]
do-loops
[!$omp end do simd [nowait]]
21 Advanced OpenMP
Example
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
parallelize
Thread 0 Thread 1 Thread 2
22 Advanced OpenMP
Be Careful What You Wish For…
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
◼ You should choose chunk sizes that are multiples of the SIMD length
→ Remainder loops are not triggered
→ Likely better performance
◼ In the above example …
→ and AVX2, the code will only execute the remainder loop!
→ and SSE, the code will have one iteration in the SIMD loop plus one in the remainder loop!
23 Advanced OpenMP
OpenMP 4.5 Simplifies SIMD Chunks
float sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(simd: static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
void example() {
#pragma omp parallel for simd
for (i=0; i<N; i++) {
d[i] = min(distsq(a[i], b[i]), c[i]);
} }
25 Advanced OpenMP
SIMD Function Vectorization
◼ Declare one or more functions to be compiled for calls from a SIMD-
parallel loop
◼ Syntax (C/C++):
#pragma omp declare simd [clause[[,] clause],…]
[#pragma omp declare simd [clause[[,] clause],…]]
[…]
function-definition-or-declaration
◼ Syntax (Fortran):
!$omp declare simd (proc-name-list)
26 Advanced OpenMP
SIMD Function Vectorization
#pragma omp declare simd
float min(float a, float b) { _ZGVZN16vv_min(%zmm0, %zmm1):
return a < b ? a : b; vminps %zmm1, %zmm0, %zmm0
} ret
28 Advanced OpenMP
inbranch & notinbranch
#pragma omp declare simd inbranch
float do_stuff(float x) {
/* do something */ vec8 do_stuff_v(vec8 x, mask m) {
/* do something */
return x * 2.0;
vmulpd x{m}, 2.0, tmp
} return tmp;
}
void example() {
#pragma omp simd
for (int i = 0; i < N; i++)
if (a[i] < 0.0)
b[i] = do_stuff(a[i]);
}
for (int i = 0; i < N; i+=8) {
vcmp_lt &a[i], 0.0, mask
b[i] = do_stuff_v(&a[i], mask);
}
29 Advanced OpenMP
SIMD Constructs & Performance
5.00x
ICC auto-vec
4.50x 4.34x
ICC SIMD directive
4.00x
3.66x
3.50x
relative speed-up
(higher is better) 3.00x
2.50x 2.40x
2.13x
2.04x
2.00x
1.47x
1.50x
1.00x
0.50x
0.00x
Mandelbrot Volume Rendering BlackScholes Fast Walsh Perlin Noise SGpp
M.Klemm, A.Duran, X.Tian, H.Saito, D.Caballero, and X.Martorell. Extending OpenMP with Vector Constructs for Modern
Multicore SIMD Architectures. In Proc. of the Intl. Workshop on OpenMP, pages 59-72, Rome, Italy, June 2012. LNCS 7312.
30 Advanced OpenMP
OpenMP: Memory Access
1 Advanced OpenMP
Example: Loop Parallelization
◼ Assume the following: you have learned that load imbalances can
severely impact performance and a dynamic loop schedule may
prevent this:
→What is the issue with the following code:
double* A;
A = (double*) malloc(N * sizeof(double));
/* assume some initialization of A */
2 Advanced OpenMP
False Sharing
◼ False Sharing: Parallel accesses to the same cache line may have a significant performance
impact!
Caches are organized in lines of typically
2: A[1]+=1; 1: A[0]+=1;
4: A[3]+=1; 3: A[2]+=1; 64 bytes: integer array a[0-4] fits into
Core Core Core Core one cache line.
on-chip cache on-chip cache
Whenever one element of a cache line
is updated, the whole cache line is
Invalidated.
bus
Local copies of a cache line have to be
re-loaded from the main memory and
memory the computation may have to be
A[0-7] repeated.
3 Advanced OpenMP
Non-uniform Memory
How To Distribute The Data ?
double* A;
A = (double*) Core Core Core Core
malloc(N * sizeof(double));
on-chip on-chip on-chip on-chip
cache cache cache cache
interconnect
for (int i = 0; i < N; i++) {
A[i] = 0.0;
}
memory memory
4 Advanced OpenMP
Non-uniform Memory
◼ Serial code: all array elements are allocated in the memory of the NUMA node closest to the
core executing the initializer thread (first touch)
double* A;
A = (double*)
malloc(N * sizeof(double)); Core Core Core Core
memory memory
A[0] … A[N]
5 Advanced OpenMP
About Data Distribution
◼ Important aspect on cc-NUMA systems
→If not optimal, longer memory access times and hotspots
◼ Windows, Linux and Solaris all use the “First Touch” placement policy
by default
→May be possible to override default (check the docs)
6 Advanced OpenMP
Non-uniform Memory
◼ Serial code: all array elements are allocated in the memory of the NUMA node closest to the
core executing the initializer thread (first touch)
double* A;
A = (double*)
malloc(N * sizeof(double)); Core Core Core Core
memory memory
A[0] … A[N]
7 Advanced OpenMP
First Touch Memory Placement
◼ First Touch w/ parallel code: all array elements are allocated in the memory of the NUMA
node that contains the core that executes the
thread that initializes the partition
double* A;
A = (double*) Core Core Core Core
malloc(N * sizeof(double));
on-chip on-chip on-chip on-chip
cache cache cache cache
omp_set_num_threads(2);
memory memory
ser_init: a[0,N-1] T1 T2 T3 T7 T8 T9
b[0,N-1]
MEM CPU 0 CPU 1 MEM
c[0,N-1]
T4 T5 T6 T10 T11 T12
9 Advanced OpenMP
Get Info on the System Topology
◼ Before you design a strategy for thread binding, you should have a basic
understanding of the system topology. Please use one of the following
options on a target machine:
→Intel MPI‘s cpuinfo tool
→ cpuinfo
→Delivers information about the number of sockets (= packages) and the mapping of processor
ids to cpu cores that the OS uses.
→Displays a graphical representation of the system topology, separated into NUMA nodes, along
with the mapping of processor ids to cpu cores that the OS uses and additional info on caches.
10 Advanced OpenMP
Decide for Binding Strategy
◼ Selecting the „right“ binding strategy depends not only on the topology,
but also on application characteristics.
→Putting threads far apart, i.e., on different sockets
→May improve aggregated memory bandwidth available to application
→Putting threads close together, i.e., on two adjacent cores that possibly share
some caches
→May improve performance of synchronization constructs
◼ Goals
→ user has a way to specify where to execute OpenMP threads
→ locality between OpenMP threads / less false sharing / memory bandwidth
12 Advanced OpenMP
Places
◼ Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7
13 Advanced OpenMP
Places + Binding Policies (2/2)
◼ Example‘s Objective:
→separate cores for outer loop and near cores for inner loop
◼ Outer Parallel Region: proc_bind(spread) num_threads(4)
Inner Parallel Region: proc_bind(close) num_threads(4)
→spread creates partition, compact binds threads within respective partition
OMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-3):8:4 = cores
#pragma omp parallel proc_bind(spread) num_threads(4)
#pragma omp parallel proc_bind(close) num_threads(4)
◼ Example
p0 p1 p2 p3 p4 p5 p6 p7
→initial
p0 p1 p2 p3 p4 p5 p6 p7
→spread 4
→close 4 p0 p1 p2 p3 p4 p5 p6 p7
14 Advanced OpenMP
More Examples (1/3)
◼ Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7
15 Advanced OpenMP
More Examples (2/3)
◼ Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7
◼ Parallel Region with four threads, one per core, but only on the first
socket
→OMP_PLACES=cores
16 Advanced OpenMP
More Examples (3/3)
◼ Spread a nested loop first across two sockets, then among the cores
within each socket, only one thread per core
→OMP_PLACES=cores
17 Advanced OpenMP
Places API (1/2)
◼ 1: Query information about binding and a single place of
all places with ids 0 … omp_get_num_places():
18 Advanced OpenMP
Places API (2/2)
◼ 2: Query information about the place partition:
◼ int omp_get_place_num(): returns the place number of the place to which the
current thread is bound
19 Advanced OpenMP
Places API: Example
◼ Simple routine printing the processor ids of the place the calling thread
is bound to:
void print_binding_info() {
int my_place = omp_get_place_num();
int place_num_procs = omp_get_place_num_procs(my_place);
free(place_processors);
}
20 Advanced OpenMP
OpenMP 5.0 way to do this
◼ Set OMP_DISPLAY_AFFINITY=TRUE
→Instructs the runtime to display formatted affinity information
21 Advanced OpenMP
Affinity format specification
t omp_get_team_num() a omp_get_ancestor_thread_num() at level-1
T omp_get_num_teams() H hostname
L omp_get_level() P process identifier
n omp_get_thread_num() i native thread identifier
N omp_get_num_threads() A thread affinity: list of processors (cores)
◼ Example:
OMP_AFFINITY_FORMAT=“Affinity: %0.3L %.8n %.15{A} %.12H“
→Possible output:
Affinity: 001 0 0-1,16-17 host003
Affinity: 001 1 2-3,18-19 host003
22 Advanced OpenMP
A first summary
◼ Everything under control?
◼ In principle Yes, but only if
→threads can be bound explicitly,
→data can be placed well by first-touch, or can be migrated,
→you focus on a specific platform (= OS + arch) → no portability
23 Advanced OpenMP
NUMA Strategies: Overview
◼ First Touch: Modern operating systems (i.e., Linux >= 2.4) decide for a
physical location of a memory page during the first page fault, when
the page is first „touched“, and put it close to the CPU causing the
page fault.
24 Advanced OpenMP
User Control of Memory Affinity
◼ Explicit NUMA-aware memory allocation:
→By carefully touching data by the thread which later uses it
→By changing the default memory allocation strategy
→Linux: numactl command
→Windows: VirtualAllocExNuma() (limited functionality)
25 Advanced OpenMP
Improving Tasking Performance:
Task Affinity
26 Advanced OpenMP
Motivation
◼ Techniques for process binding & thread pinning available
→OpenMP thread level: OMP_PLACES & OMP_PROC_BIND
OpenMP Tasking:
◼ In general: Tasks may be executed by any thread in the team
→Missing task-to-data affinity may have detrimental effect on performance
OpenMP 5.0:
◼ affinity clause to express affinity to data
27 Advanced OpenMP
affinity clause
◼ New clause: #pragma omp task affinity (list)
→Hint to the runtime to execute task closely to physical data location
◼ Expectations:
→Improve data locality / reduce remote memory accesses
→Assumption: initialization and computation have same blocking and same affinity
29 Advanced OpenMP
Selected LLVM implementation details
Task with No
Encounter task Push to local
data
region … queue
affinity?
A map is introduced to
Yes store location information
of data that was previously
Location
No for data used
reference in
map?
Jannis Klinkenberg, Philipp Samfass,
Christian Terboven, Alejandro Duran,
Identify NUMA Yes Michael Klemm, Xavier Teruel, Sergi
domain where Mateo, Stephen L. Olivier, and Matthias
data is stored S. Müller. Assessing Task-to-Data Affinity
in the LLVM OpenMP Runtime.
Proceedings of the 14th International
Workshop on OpenMP, IWOMP 2018.
Select thread Save Push task into September 26-28, 2018, Barcelona,
pinned to {reference, other threads end Spain.
NUMA domain location} in map queue
30 Advanced OpenMP
Evaluation
Program runtime Distribution of single
Median of 10 runs task execution times
Speedup
of 4.3 X
→single task creator scenario, or task not created with data affinity
32 Advanced OpenMP
Managing Memory Spaces
33 Advanced OpenMP
Different kinds of memory
◼ Traditional DDR-based memory
◼ High-bandwidth memory
◼ Non-volatile memory
◼…
34 Advanced OpenMP
Memory Management
◼ Allocator := an OpenMP object that fulfills requests to allocate and
deallocate storage for program variables
35 Advanced OpenMP
OpenMP Allocators
◼ Selection of a certain kind of memory
Allocator name Storage selection intent
omp_default_mem_alloc use default storage
omp_large_cap_mem_alloc use storage with large capacity
omp_const_mem_alloc use storage optimized for read-only variables
omp_high_bw_mem_alloc use storage with high bandwidth
omp_low_lat_mem_alloc use storage with low latency
omp_cgroup_mem_alloc use storage close to all threads in the contention group
of the thread requesting the allocation
omp_pteam_mem_alloc use storage that is close to all threads in the same
parallel region of the thread requesting the allocation
omp_thread_local_mem_alloc use storage that is close to the thread requesting the
allocation
36 Advanced OpenMP
Using OpenMP Allocators
◼ New clause on all constructs with data sharing clauses:
→ allocate( [allocator:] list )
◼ Allocation:
→ omp_alloc(size_t size, omp_allocator_handle_t allocator)
◼ Deallocation:
→ omp_free(void *ptr, const omp_allocator_handle_t allocator)
37 Advanced OpenMP
OpenMP Allocator Traits / 1
◼ Allocator traits control the behavior of the allocator
sync_hint contended, uncontended, serialized, private
default: contended
alignment positive integer value that is a power of two
default: 1 byte
access all, cgroup, pteam, thread
default: all
pool_size positive integer value
39 Advanced OpenMP
OpenMP Allocator Traits / 3
◼ partition: partitioning of allocated memory of physical storage
resources (think of NUMA)
→environment: use system’s default behavior
→blocked: partitioning into approx. same size with at most one block per
storage resource
40 Advanced OpenMP
OpenMP Allocator Traits / 4
◼ Construction of allocators with traits via
→omp_allocator_handle_t omp_init_allocator(
omp_memspace_handle_t memspace,
int ntraits, const omp_alloctrait_t traits[]);
41 Advanced OpenMP
OpenMP Memory Spaces
◼ Storage resources with explicit support in OpenMP:
omp_default_mem_space System’s default memory resource
omp_large_cap_mem_space Storage with larg(er) capacity
omp_const_mem_space Storage optimized for variables with constant value
omp_high_bw_mem_space Storage with high bandwidth
omp_low_lat_mem_space Storage with low latency
42 Advanced OpenMP