Uk Openmp Users 2018 Advanced Openmp Tutorial PDF
Uk Openmp Users 2018 Advanced Openmp Tutorial PDF
Michael Klemm
Jim Cownie
Coffee 30 min
http://bit.ly/omp_uk_ug_tut
Multiple choice:
1. All the things that you may have heard of, but have never used…
2. Things which have appeared in OpenMP since you took that
undergraduate course
3. Anything beyond !$omp parallel for
4. All of the above
Major:
• Tasking (coming up soon)
• Vectorization (Michael, after coffee)
• Offload to accelerator devices (covered by Simon this afternoon.
We’re not covering this at all since he is and can go deeper than we
could)
Minor (next, small, simple, give you time to wake up )
• Lock/critical/atomic (5.0) hints
• New dynamic schedule
21 May 2018 UK OpenMP Users' Group 8
Lock/critical/atomic hints
What?
A way of giving the implementation more information about the way
you’d like a lock or critical section to be implemented
A new lock initialization function omp_init_lock_with_hint(…)
A hint clause on omp critical (and, in 5.0 omp atomic)
A set of synchronization hints
omp_sync_hint_{none, contended/uncontended,
speculative/nonspeculative}
} }
compute_in_parallel(A);
compute_in_parallel_too(B);
cblas_dgemm(…, A, B, …);
C/C++ Fortran
#pragma omp taskwait !$omp taskwait
lvl 6
Duration: 0.16 sec
lvl 12
Duration: 0.047 sec
Every thread is executing ~1.3m tasks…
… in ~5.7 seconds => average duration of a task Tasks get much smaller down the
is ~4.4 μs call-stack.
21 May 2018 UK OpenMP Users' Group 27
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding
8 4,0
7 3,5
Runtime [sec] for 16x16
6 3,0
5 2,5
Is this the best we
Speedup
4 2,0
can can do?
3 1,5
2 1,0
1 0,5
0 0,0
1 2 3 4 5 6 7 8 9 10 11 12 16 24 32
#threads
lvl 6
Duration: 0.16 sec
Performance and Scalability Tuning Idea: when you have created enough tasks to
keep you cores busy, stop creating more tasks!
• if-clause lvl 12
Duration: 0.047 sec
Every thread is executing
• final-clause, ~1.3m tasks…
mergeable-clause
• natively in your program code
lvl 48 Duration: 0.001 sec
… in ~5.7 seconds => average duration of a task Tasks get much smaller down the
is ~4.4 μs call-stack.
21 May 2018 UK OpenMP Users' Group 29
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
speedup 7 16
Speedup
10
4
before! 3
8
2
4
1 2
0 0
1 2 3 4 5 6 7 8 9 10 11 12 16 24 32
#threads
• The priority is a hint to the runtime system for task execution order
• Among all tasks ready to be executed, higher priority tasks are recommended to
execute before lower priority ones
• priority is non-negative numerical scalar (default: 0)
• priority <= max-task-priority ICV
• environment variable OMP_MAX_TASK_PRIORITY
• You cannot rely on task execution order being determined by this clause; it‘s only
a hint and can be ignored!
21 May 2018 UK OpenMP Users' Group 39
final Clause
C/C++ Fortran
#pragma omp task final(expr) !$omp task final(expr)
• The task dependence is fulfilled when the predecessor task has completed
• in dependency-type: the generated task will be a dependent task of all previously
generated sibling tasks that reference at least one of the list items in an out or
inout clause.
• out and inout dependency-type: The generated task will be a dependent task of
all previously generated sibling tasks that reference at least one of the list items in
an in, out, or inout clause.
• mutexinoutset: only one task in the set may execute at any time (OpenMP 5.0!)
• The list items in a depend clause may include array sections.
Full article
here: http://www.hpcwire.com/2015/10/19/numerical-
algorithms-and-libraries-at-exascale/
• Syntax (C/C++)
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops
• Syntax (Fortran)
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]
• grainsize(grain-size)
Chunks have at least grain-size and max 2*grain-size loop iterations
• num_tasks(num-tasks)
Create num-tasks tasks for iterations of the loop
double* A;
Core Core Core Core
A = (double*)
malloc(N * sizeof(double)); on-chip on-chip on-chip on-chip
cache cache cache cache
for (int i = 0; i < N; i++) {
A[i] = 0.0;
interconnect
}
memory memory
Non-uniform Memory
Serial code: all array elements are
allocated in the memory of the NUMA
node closest to the core executing the Core Core Core Core
initializer thread (first touch)
on-chip on-chip on-chip on-chip
cache cache cache cache
double* A;
A = (double*)
malloc(N * sizeof(double));
interconnect
• Goals
• user has a way to specify where to execute OpenMP threads for locality between
OpenMP threads / less false sharing / memory bandwidth
OMP_PLACES Environment Variable
• Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7
p0 p1 p2 p3 p4 p5 p6 p7
p0 p1 p2 p3 p4 p5 p6 p7
p0 p1 p2 p3 p4 p5 p6 p7
More Examples (1/3)
• Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7
• Parallel Region with four threads, one per core, but only on the first
socket
• OMP_PLACES=cores
• #pragma omp parallel num_threads(4) \
proc_bind(close)
More Examples (3/3)
• Spread a nested loop first across two sockets, then among the cores
within each socket, only one thread per core
• OMP_PLACES=cores
• #pragma omp parallel num_threads(2) \
proc_bind(spread)
#pragma omp parallel num_threads(4) \
proc_bind(close)
Places API: Example
• Simple routine printing the processor ids of the place the calling
thread is bound to:
void print_binding_info() {
int my_place = omp_get_place_num();
int place_num_procs = omp_get_place_num_procs(my_place);
free(place_processors);
}
A First Summary
• Everything is under control now?
• In principle yes, but only if
• threads can be bound explicitly,
• data can be placed well by first-touch, or can be migrated,
• you focus on a specific platform (= os + arch) → no portability
• Next Touch: The binding of pages to NUMA nodes is removed and pages
are put in the location of the next „touch“; well supported in Solaris,
expensive to implement in Linux
• Allocation:
• omp_alloc(size_t size, omp_allocator_t *allocator)
• Deallocation:
• omp_free(void *ptr, const omp_allocator_t *allocator)
• allocator argument is optional
• allocate directive
• Standalone directive for allocation, or declaration of allocation stmt.
21
Example: Using Memory Allocators (v5.0)
void allocator_example(omp_allocator_t *my_allocator) {
int a[M], b[N], c;
#pragma omp allocate(a) allocator(omp_high_bw_mem_alloc)
#pragma omp allocate(b) // controlled by OMP_ALLOCATOR and/or omp_set_default_allocator
omp_alloc(N*M*sizeof(*p), my_allocator);
double *p = (double *) malloc(N*M*sizeof(*p));
omp_free(p);
}
22
OpenMP Task Affinity (v5.0)
• OpenMP version 5.0 will support task affinity
#pragma omp task affinity(<var-reference>)
• Task-to-data affinity
• Hint to execute task as close as possible to the location of the data
23
OpenMP Task Affinity
void task_affinity() {
Core Core Core Core
double* B;
#pragma omp task shared(B)
on-chip on-chip on-chip on-chip
{
cache cache cache cache
B = init_B_and_important_computation(A);
}
#pragma omp task firstprivate(B)
{ interconnect
important_computation_too(B);
}
#pragma omp taskwait
}
memory memory
24
OpenMP Task Affinity
void task_affinity() {
Core Core Core Core
double* B;
#pragma omp task shared(B) affinity(A[0:N])
on-chip on-chip on-chip on-chip
{
cache cache cache cache
B = init_B_and_important_computation(A);
}
#pragma omp task firstprivate(B) affinity(B[0:N])
{ interconnect
important_computation_too(B);
}
#pragma omp taskwait
}
memory memory
A[0] … A[N]
B[0] … B[N]
25
Partitioning Memory w/ OpenMP version 5.0
void allocator_example() {
double *array;
omp_allocator_t *allocator;
omp_alloctrait_t traits[] = {
{OMP_ATK_PARTITION, OMP_ATV_BLOCKED}
};
int ntraits = sizeof(traits) / sizeof(*traits);
allocator = omp_init_allocator(omp_default_mem_space, ntraits, traits);
omp_free(array);
}
26
Summary
• (Correct) memory placement is crucial for performance for most
applications
• OpenMP version 5.0 will bring additional features for more portable
memory optimizations
27
OpenMP SIMD Programming
Evolution of SIMD on Intel® Architectures
128 bit
2 x DP
SSE
4 x SP
256 bit
4 x DP
AVX
8 x SP
512 bit
AVX-512 8 x DP
16 x SP
2
SIMD Instructions –Arithmetic Instructions
Operations work on each individual SIMD element
vaddpd dest, source1, source2
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2
=
a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0
dest
3
SIMD Instructions – Fused Instructions
Two operations (e.g., multiply & add) fused into one SIMD instruction
vfmadd213pd source1, source2, source3
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
*
b7 b6 b5 b4 b3 b2 b1 b0 source2
+
c7 c6 c5 c4 c3 c2 c1 c0 source3
=
a7*b7 a6*b6 a5*b5 a4 *b4 a3*b3 a2*b2 a1*b1 a0*b0 dest (=source1)
+c7 +c6 +c5 +c4 +c3 +c2 +c1 +c0
4
SIMD Instructions – Conditional Evaluation
Mask register limit effect of instructions to a subset of the SIMD
elements vaddpd dest{k1}, source1, source2
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2
5
SIMD Instructions – Broadcast
Assign a scalar value to all SIMD elements
vbroadcast dest, scalar
512 bit
s scalar
s s s s s s s s dest
6
SIMD Instructions – Shuffles, Swizzles, Blends
Instruction to modify data layout in the SIMD register
vmovapd dest, source{dacb}
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source
swizzle
a7 a4 a6 a5 a3 a0 a2 a1 “tmp”
“move”
a7 a4 a6 a5 a3 a0 a2 a1 dest
7
Auto-vectorization
• Compilers offer auto-vectorization as an optimization pass
• Usually part of the general loop optimization passes
•
•
•
Code analysis detects code properties that inhibit SIMD vectorization
Heuristics determine if SIMD execution might be beneficial
If all goes well, the compiler will generate SIMD instructions
?
• Example: Intel® Composer XE
• -vec (automatically enabled with –O2)
• -qopt-report
8
Interlude: Data Dependencies
• Suppose two statements S1 and S2
• S2 depends on S1, iff S1 must execute before S2
• Control-flow dependence
• Data dependence
• Dependencies can be carried over between loop iterations
• Important flavors of data dependencies
FLOW ANTI
s1: a = 40 b = 40
b = 21 s1: a = b + 1
s2: c = a + 2 s2: b = 21
9
Interlude: Loop-carried Dependencies
• Dependencies may occur across loop iterations
• Then they are called “loop-carried dependencies”
• “Distance” of a dependency: number of loop iterations the dependency spans
• The following code contains such a dependency:
void lcd_ex(float* a, float* b, size_t n, float c1, float c2) {
for (int i = 0; i < n; i++) {
a[i] = c1 * a[i + 17] + c2 * b[i];
} } Loop-carried dependency for a[i]
and a[i+17]; distance is 17.
0 1 2 3 17 18 19 20
• Parallelization: no
(except for very specific loop schedules)
• Vectorization: yes
(iff vector length is shorter than any distance of any dependency)
11
Why Auto-vectorizers Fail
• Data dependencies
• Other potential reasons
• Alignment
• Function calls in loop block
• Complex control flow / conditional branches
• Loop not “countable”
• E.g. upper bound not a runtime constant
• Mixed data types
• Non-unit stride between elements
• Loop body too complex (register pressure)
• Vectorization seems inefficient
• Many more … but less likely to occur
12
Example: Loop not Countable
• “Loop not Countable” plus “Assumed Dependencies”
typedef struct {
float* data;
int size;
} vec_t;
13
OpenMP SIMD Loop Construct
• Vectorize a loop nest
• Cut loop into chunks that fit a SIMD vector register
• No parallelization of the loop body
• Syntax (C/C++)
#pragma omp simd [clause[[,] clause],…]
for-loops
• Syntax (Fortran)
!$omp simd [clause[[,] clause],…]
do-loops
14
Example
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
vectorize
15
Data Sharing Clauses
• private(var-list):
Uninitialized vectors for variables in var-list
x: 42 ? ? ? ?
• firstprivate(var-list):
Initialized vectors for variables in var-list
x: 42 42 42 42 42
• reduction(op:var-list):
Create private variables for var-list and apply reduction operator op at the end of the construct
12 5 8 17 x: 42
16
SIMD Loop Clauses
• safelen (length)
• Maximum number of iterations that can run concurrently without breaking a
dependence
• In practice, maximum vector length
• linear (list[:linear-step])
• The variable’s value is in relationship with the iteration number
• xi = xorig + i * linear-step
• aligned (list[:alignment])
• Specifies that the list items have a given alignment
• Default is alignment for the architecture
• collapse (n)
17
SIMD Worksharing Construct
• Parallelize and vectorize a loop nest
• Distribute a loop’s iteration space across a thread team
• Subdivide loop chunks to fit a SIMD vector register
• Syntax (C/C++)
#pragma omp for simd [clause[[,] clause],…]
for-loops
• Syntax (Fortran)
!$omp do simd [clause[[,] clause],…]
do-loops
[!$omp end do simd [nowait]]
18
Example
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
parallelize
Thread 0 Thread 1 Thread 2
19
Be Careful What You Wish For…
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
• You should choose chunk sizes that are multiples of the SIMD length
• Remainder loops are not triggered
• Likely better performance
• In the above example …
• and AVX2 (= 8-wide), the code will only execute the remainder loop!
• and SSE (=4-wide), the code will have one iteration in the SIMD loop plus one in the remainder loop!
20
OpenMP 4.5 SIMD Chunks
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(simd: static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
21
SIMD Function Vectorization
float min(float a, float b) {
return a < b ? a : b;
}
void example() {
#pragma omp parallel for simd
for (i=0; i<N; i++) {
d[i] = min(distsq(a[i], b[i]), c[i]);
} }
22
SIMD Function Vectorization
• Declare one or more functions to be compiled for calls from a SIMD-
parallel loop
• Syntax (C/C++):
#pragma omp declare simd [clause[[,] clause],…]
[#pragma omp declare simd [clause[[,] clause],…]]
[…]
function-definition-or-declaration
• Syntax (Fortran):
!$omp declare simd (proc-name-list)
23
SIMD Function Vectorization
#pragma omp declare simd _ZGVZN16vv_min(%zmm0, %zmm1):
float min(float a, float b) { vminps %zmm1, %zmm0, %zmm0
return a < b ? a : b; ret
}
25
SIMD Constructs & Performance
5,00x
4,34x ICC auto-vec
4,50x
ICC SIMD directive
4,00x 3,66x
3,50x
relative speed-up
(higher is better)
3,00x
2,40x
2,50x 2,13x
2,04x
2,00x
1,47x
1,50x
1,00x
0,50x
0,00x
Mandelbrot Volume BlackScholes Fast Walsh Perlin Noise SGpp
Rendering
Klemm, A.Duran, X.Tian, H.Saito, D.Caballero, and X.Martorell. Extending OpenMP with Vector Constructs for Modern
Multicore SIMD Architectures. In Proc. of the Intl. Workshop on OpenMP, pages 59-72, Rome, Italy, June 2012. LNCS 7312.
26
OpenMPCon & IWOMP 2018
• Tentative dates:
• OpenMPCon: Sep 24-25
• Tutorials: Sep 26
• IWOMP: Sep 27-28
28
Summary
• OpenMP provided a powerful, expressive tasking model