0% found this document useful (0 votes)

160 views106 pages

Uk Openmp Users 2018 Advanced Openmp Tutorial PDF

Uploaded by

Victor Manuel Chavez Bruno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views106 pages

Uk Openmp Users 2018 Advanced Openmp Tutorial PDF

Uploaded by

Victor Manuel Chavez Bruno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Advanced OpenMP Tutorial

Jim Cownie and Michael Klemm

Advanced OpenMP Tutorial

Michael Klemm
Jim Cownie

21 May 2018 UK OpenMP Users' Conference 2

Credits
The slides were jointly developed by:
Christian Terboven
Michael Klemm
Ruud van der Pas
Members of the
Eric Stotzer OpenMP Language
Committee
Bronis R. de Supinski
Sergi Mateo
Xavier Teruel
(Tweaked by Jim Cownie )

21 May 2018 UK OpenMP Users' Conference 3

Agenda

Topic Speaker Time

What is “Advanced OpenMP?”/Miscellaneous

Jim 15 min
features
OpenMP Tasking Jim 75 min

Coffee 30 min

NUMA Awarenesss Michael 30 min

Vectorization/SIMD Michael 60 min

21 May 2018 UK OpenMP Users' Group 4

Updated Slides

http://bit.ly/omp_uk_ug_tut

21 May 2018 UK OpenMP Users' Group 5

What is “Advanced OpenMP”?
What is “Advanced OpenMP”?

Multiple choice:
1. All the things that you may have heard of, but have never used…
2. Things which have appeared in OpenMP since you took that
undergraduate course
3. Anything beyond !$omp parallel for
4. All of the above

All of the above is a good answer. We may not be able to cover

it all, though!

21 May 2018 UK OpenMP Users' Group 7

Recent OpenMP Features

Major:
• Tasking (coming up soon)
• Vectorization (Michael, after coffee)
• Offload to accelerator devices (covered by Simon this afternoon.
We’re not covering this at all since he is and can go deeper than we
could)
Minor (next, small, simple, give you time to wake up )
• Lock/critical/atomic (5.0) hints
• New dynamic schedule
21 May 2018 UK OpenMP Users' Group 8
Lock/critical/atomic hints
What?
A way of giving the implementation more information about the way
you’d like a lock or critical section to be implemented
A new lock initialization function omp_init_lock_with_hint(…)
A hint clause on omp critical (and, in 5.0 omp atomic)
A set of synchronization hints
omp_sync_hint_{none, contended/uncontended,
speculative/nonspeculative}

21 May 2018 UK OpenMP Users' Group 9

Lock/critical/atomic hints
Why?
Modern processors support
speculative execution (“transactional
memory”). Present in processors
from Intel, IBM, …
Allows concurrent execution of
critical sections if they do not conflict
Can give the performance of a fine-
grained reader/writer lock while only
having to code a simple coarse
grained lock

21 May 2018 UK OpenMP Users' Group 10

Experiment details
Take std::unordered_map<uint32_t,uint32_t> and wrap it in OpenMP locks.
class lockedHash {
std::unordered_map<uint32_t, uint32_t> theMap;
omp_lock_t theLock;
public:
lockedHash(omp_lock_hint_t hint) {omp_init_lock_with_hint(&theLock, hint);} Only change lock initialization…
void insert(uint32_t key, uint32_t value) {
omp_set_lock(&theLock); No changes to lock use
theMap.insert({key,value});
omp_unset_lock(&theLock);
}
uint32_t lookup(uint32_t key) {
omp_set_lock(&theLock);
auto result = theMap.find(key);
omp_unset_lock(&theLock);
return result == theMap.end() ? 0 : result->second;
}
};

Measure total machine throughput as we add cores (1T/C), doing lookups or

updates as fast as they can when using omp_sync_hint_none and
omp_sync_hint_speculative to initialize theLock.

21 May 2018 UK OpenMP Users' Group 11

New dynamic scheduling option
schedule({monotonic,nonmonotonic}:dynamic)
nonmonotonic allows an iteration stealing scheduling scheme
which can out-perform a default dynamic schedule.
Beware: nonmonotonic is becoming the default schedule in
OpenMP 5.0
Difference: monotonic requires each thread sees iterations which only
move in one direction, nonmonotonic allows them to move
“backwards”
e.g. in for(i=0; i<5; i++) a thread may see 3,4,0 with a
nonmonotonic:dynamic schedule.

21 May 2018 UK OpenMP Users' Group 13

OpenMP Tasking: Irregular and Recursive Parallelism
OpenMP Tasking
• What is tasking?
• Introduction by Example: Sudoku
• Data Scoping
• Scheduling and Dependencies
• Taskloops
• More Tasking Stuff

21 May 2018 UK OpenMP Users' Group 15

What is tasking?
First: What is “Classic” OpenMP?
“Classic” OpenMP treats threads as a fundamental concept
• You know how many there are (omp_get_num_threads())
• You know which one you are (omp_get_thread_num())
• A major concern is how to share work between threads
• Choice of schedule clause on for loops
• Explicit decisions based on omp_get_thread_num()
• A whole section in the standard on Worksharing Constructs!
• The standard describes semantics in terms of threads, e.g. for barrier
“All threads of the team executing the binding parallel region must execute the
barrier region…”

21 May 2018 UK OpenMP Users' Group 16

What is tasking?
Task model
Tasking lifts your thinking
• Forget about threads, and about scheduling work to them
• Instead think how your code can be broken into chunks of work which
can execute in parallel (“tasks”)
• Let the runtime system handle how to execute the work
• We’re not going to discuss how this works, but it is fun. Talk to me if you want
to find out more.
• Think in terms of work being complete rather than threads getting to
some point in the code
• Ideas from Cilk, also implemented in TBB for C++

21 May 2018 UK OpenMP Users' Group 17

Problems with traditional worksharing
• Worksharing constructs do not compose well
• Pathological example: parallel dgemm
void example() {
#pragma omp parallel
{
compute_in_parallel(A);
compute_in_parallel_too(B);
// dgemm is either parallel or sequential,
// but has no orphaned worksharing
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);

} }

• Writing such code either

• oversubscribes the system,
• yields bad performance due to OpenMP overheads, or
• needs a lot of glue code to use sequential dgemm only for sub-matrixes

21 May 2018 UK OpenMP Users' Group 18

Ragged Fork/Join

• Traditional worksharing can lead to ragged fork/join patterns

void example() {

compute_in_parallel(A);

compute_in_parallel_too(B);

cblas_dgemm(…, A, B, …);

21 May 2018 UK OpenMP Users' Group 19

Introduction by Example: Sudoku

Let‘s solve Sudoku puzzles with brute multi-core force

Find an empty cell

For value in 0:15
If (not valid) continue
Recurse for next empty cell or print result
if this was the last cell.
Wait for completion

Note: this is a 16x16 sudoku so we‘re

seaching ~16220 = 8.e264 configurations!

21 May 2018 UK OpenMP Users' Group 20

Why Do We Need Tasks?
This is a recursive problem
Tasks will take different amounts of time
Some rapidly reach an inconsistent state
Some nearly succeed, so run for much longer
One succeeds (assuming a well defined problem!)
We want to exploit parallelism at every level
But nested OpenMP parallelism is “complicated” 

21 May 2018 UK OpenMP Users' Group 21

The OpenMP Task Construct
C/C++ Fortran
#pragma omp task [clause] !$omp task [clause]
... structured block ... ... code ...
!$omp end task

Each encountering thread/task creates a new task

Code and data is packaged up
Tasks can be nested
Into another task directive
Into a Worksharing construct
Data scoping clauses:
shared(list)
private(list) firstprivate(list)
default(shared | none)

21 May 2018 UK OpenMP Users' Group 22

Barrier and Taskwait Constructs

OpenMP barrier (implicit or explicit)

• All tasks created by any thread of the current Team are guaranteed to have
completed at barrier exit
C/C++ Fortran
#pragma omp barrier !$omp barrier

Task barrier: taskwait

• Encountering task is suspended until child tasks complete
• Applies only to children, not all descendants!

C/C++ Fortran
#pragma omp taskwait !$omp taskwait

21 May 2018 UK OpenMP Users' Group 23

Parallel Brute-force Sudoku
first call contained in a
This parallel algorithm finds all valid solutions #pragma omp parallel
Find an empty cell #pragma omp single
such that one task starts the
For value in 0:15 execution of the algorithm
If (not valid) continue
Recurse for next empty cell,or print result #pragma omp task
needs to work on a new copy
of the Sudoku board

Wait for completion #pragma omp taskwait

wait for all child tasks

21 May 2018 UK OpenMP Users' Group 24

Parallel Brute-force Sudoku (2/3)
OpenMP parallel region creates a team of threads
#pragma omp parallel
{
#pragma omp single
solve_parallel(0, 0, sudoku2,false);
} // end omp parallel
• Single construct: One thread enters the execution of solve_parallel
• the other threads wait at the end of the single …
• … and are ready to pick up threads from the work queue
• Syntactic sugar (either you like it or you don‘t)
#pragma omp parallel sections
{
solve_parallel(0, 0, sudoku2,false);
} // end omp parallel

21 May 2018 UK OpenMP Users' Group 25

Parallel Brute-force Sudoku (3/3)

The actual implementation

for (int i = 1; i <= sudoku->getFieldSize(); i++) {
if (!sudoku->check(x, y, i)) {
#pragma omp task firstprivate(i,x,y,sudoku) #pragma omp task
{ Must work on a new copy of
// create from copy constructor the Sudoku board
CSudokuBoard new_sudoku(*sudoku);
new_sudoku.set(y, x, i);
if (solve_parallel(x+1, y, &new_sudoku))
new_sudoku.printBoard();
} // end omp task
}
}
#pragma omp taskwait #pragma omp taskwait
wait for all child tasks

21 May 2018 UK OpenMP Users' Group 26

Performance Analysis
Tracing gives more details:
Event-based profiling gives a good overview :

lvl 6
Duration: 0.16 sec

lvl 12
Duration: 0.047 sec
Every thread is executing ~1.3m tasks…

lvl 48 Duration: 0.001 sec

lvl 82 Duration: 2.2 μs

… in ~5.7 seconds => average duration of a task Tasks get much smaller down the
is ~4.4 μs call-stack.
21 May 2018 UK OpenMP Users' Group 27
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz
Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding

8 4,0

7 3,5
Runtime [sec] for 16x16

6 3,0

5 2,5
Is this the best we

Speedup
4 2,0
can can do?
3 1,5

2 1,0

1 0,5

0 0,0
1 2 3 4 5 6 7 8 9 10 11 12 16 24 32
#threads

21 May 2018 UK OpenMP Users' Group 28

Performance Analysis
Tracing gives more details:
Event-based profiling gives a good overview :

lvl 6
Duration: 0.16 sec
Performance and Scalability Tuning Idea: when you have created enough tasks to
keep you cores busy, stop creating more tasks!
• if-clause lvl 12
Duration: 0.047 sec
Every thread is executing
• final-clause, ~1.3m tasks…
mergeable-clause
• natively in your program code
lvl 48 Duration: 0.001 sec

Example: stop recursion

Analogous to choosing a chunk-size in alvlschedule(dynamic)
82 loop
Duration: 2.2 μs

… in ~5.7 seconds => average duration of a task Tasks get much smaller down the
is ~4.4 μs call-stack.
21 May 2018 UK OpenMP Users' Group 29
Performance Evaluation
Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz

Now have Intel C++ 13.1, scatter binding

speedup: Intel C++ 13.1, scatter binding
Intel C++ 13.1, scatter binding, cutoff
speedup: Intel C++ 13.1, scatter binding, cutoff
>16x 8 18

speedup 7 16

Runtime [sec] for 16x16

14
where we 6
12
5
had <4x

Speedup
10
4
before! 3
8

2
4

1 2

0 0
1 2 3 4 5 6 7 8 9 10 11 12 16 24 32
#threads

21 May 2018 UK OpenMP Users' Group 30

Task Data Scoping

Some rules from Parallel Regions apply:

Static and Global variables are shared
Automatic Storage (local) variables are private

If shared scoping is not inherited:

Orphaned Task variables are firstprivate by default!
Non-Orphaned Task variables inherit the shared attribute!
→ Variables are firstprivate unless shared in the enclosing context

21 May 2018 UK OpenMP Users' Group 31

Data Scoping Example
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared value of a: 1
// Scope of b: firstprivate value of b: undefined (Why? )
// Scope of c: shared value of c: 3
// Scope of d: firstprivate value of d: 4
// Scope of e: private value of e: 5
}
}
}

21 May 2018 UK OpenMP Users' Group 32

Use default(none)!
int a = 1;
void foo() Hint: Use default(none) to be
{
int b = 2, c = 3; forced to think about every variable if
#pragma omp parallel private(b)
{ the scope is not obvious
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared, value of a: 1
// Scope of b: firstprivate, value of b: undefined
// Scope of c: shared, value of c: 3
// Scope of d: firstprivate, value of d: 4
// Scope of e: private, value of e: 5
}
}
}

21 May 2018 UK OpenMP Users' Group 33

Scheduling
• Default: Tasks are tied to the thread that first executes them this is
normally not the creator. Scheduling constraints:
• Only the thread to which a task is tied can execute it
• A task can only be suspended at task scheduling points
• Task creation, task finish, taskwait, barrier, taskyield
• If task is not suspended in a barrier, the executing thread can only switch to a
direct descendant of a task tied to the thread
• Tasks created with the untied clause are never tied
• Allowed to resume at task scheduling points in a different thread
• No scheduling restrictions, e.g., can be suspended at any point
• Gives more freedom to the implementation, e.g., load balancing

21 May 2018 UK OpenMP Users' Group 34

Unsafe use of untied Tasks

• Problem: Because untied tasks may migrate between threads at any

point, thread-centric constructs can yield unexpected results
• Remember when using untied tasks:
• Avoid threadprivate variables Good advice
• Avoid any use of thread-ids (i.e., omp_get_thread_num()) anyway!
• Be careful with critical region and locks
• Possible solution:
• Create a tied task region with
#pragma omp task if(0)

21 May 2018 UK OpenMP Users' Group 35

if Clause

• When the expression in an if clause on a task evaluates to false

• The encountering task is suspended
• The new task is executed immediately
• The parent task resumes when the new task finishes
→ Used for optimization, e.g., avoid creation of small tasks

21 May 2018 UK OpenMP Users' Group 36

The taskyield Directive
C/C++ Fortran
#pragma omp taskyield !$omp taskyield

• The taskyield directive specifies that the current task can be

suspended in favour of execution of a different task.
• Hint to the runtime for optimization and/or deadlock prevention
• But, since it‘s only a hint it can be ignored, so you cannot rely on it to prevent
deadlock

21 May 2018 UK OpenMP Users' Group 37

taskyield Example (1/2)
#include <omp.h>
void something_useful();
void something_critical();
void foo(omp_lock_t * lock, int n)
{
for(int i = 0; i < n; i++)
#pragma omp task
{
something_useful();
while( !omp_test_lock(lock) ) { Taskyield allows the spinning task to
#pragma omp taskyield
} be suspended here, letting the
something_critical(); executing thread perform other work.
omp_unset_lock(lock);
}
}

21 May 2018 UK OpenMP Users' Group 38

priority Clause
C/C++ Fortran
#pragma omp task priority(priority-value) !$omp task priority(priority-value)
... structured block ... ...
!$omp end task

• The priority is a hint to the runtime system for task execution order
• Among all tasks ready to be executed, higher priority tasks are recommended to
execute before lower priority ones
• priority is non-negative numerical scalar (default: 0)
• priority <= max-task-priority ICV
• environment variable OMP_MAX_TASK_PRIORITY
• You cannot rely on task execution order being determined by this clause; it‘s only
a hint and can be ignored!
21 May 2018 UK OpenMP Users' Group 39
final Clause
C/C++ Fortran
#pragma omp task final(expr) !$omp task final(expr)

• For recursive problems that perform task decomposition, stopping

task creation at a certain depth exposes enough parallelism but
reduces overhead.
• Beware: merging the data environment may have side-effects
void foo(bool arg)
{
int i = 3;
#pragma omp task final(arg) firstprivate(i)
i++; // No externally visible effect if in task...
printf(“%d\n”, i); // Could print 3 or 4 depending on arg
}

21 May 2018 UK OpenMP Users' Group 40

mergeable Clause
C/C++ Fortran
#pragma omp task mergeable !$omp task mergeable

• If the mergeable clause is present, the implementation is allowed

to merge the task‘s data environment
• if the generated task is undeferred or included
• undeferred: if clause present and evaluates to false
• included: final clause present and evaluates to true
• As far as I know, no compiler or runtime exploits final or
mergeable so using them is currently futile (other than to provide
evidence to use to hassle your compiler vendor )

21 May 2018 UK OpenMP Users' Group 41

The taskgroup Construct
C/C++ Fortran
#pragma omp taskgroup !$omp taskgroup
... structured block ... ... structured block ...
!$omp end task

• Specifies a wait for completion of child tasks and their descendant

tasks
• This is deeper sychronization than taskwait, but
• with the option to restrict to a subset of all tasks (as opposed to a barrier)

21 May 2018 UK OpenMP Users' Group 42

Task Dependencies: Motivation
• Task dependences are a way to define task-execution constraints
int x = 0; OpenMP 3.1 Time
#pragma omp parallel
#pragma omp single
t1
{
#pragma omp task
std::cout << x << std::endl; t2
#pragma omp taskwait Task dependencies let us remove “strong”
#pragma omp task synchronizations, increasing the look ahead! Task’s creation time
x++;
}
Task’s execution time
int x = 0; OpenMP 4.0
#pragma omp parallel
#pragma omp single t1
{
#pragma omp task depend(in: x)
t2
std::cout << x << std::endl;
#pragma omp task depend(inout: x)
x++;
}

21 May 2018 UK OpenMP Users' Group 43

Controlling when a task starts
• In more complicated codes we have dependencies between tasks
• For instance, suppose one task(b) cannot start until another(a) has
finished because b needs to consume data which was written by a
• OpenMP provides task dependencies to let you express these
constraints
• depend(in:var) => this task consumes var
• depend(out:var) => this task produces var
• depend(inout:var) => this task consumes var and updates it
Coming in OpenMP 5.0
• depend(mutexinoutset:var)only one task using var can run at a time

21 May 2018 UK OpenMP Users' Group 44

The depend Clause
C/C++ Fortran
#pragma omp task depend(dependency-type: list) !$omp task depend(dependency-type: list)
... structured block ... ... code ...
!$omp end task

• The task dependence is fulfilled when the predecessor task has completed
• in dependency-type: the generated task will be a dependent task of all previously
generated sibling tasks that reference at least one of the list items in an out or
inout clause.
• out and inout dependency-type: The generated task will be a dependent task of
all previously generated sibling tasks that reference at least one of the list items in
an in, out, or inout clause.
• mutexinoutset: only one task in the set may execute at any time (OpenMP 5.0!)
• The list items in a depend clause may include array sections.

21 May 2018 UK OpenMP Users' Group 45

Example: Cholesky factorization
void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
potrf(a[k][k], ts, ts); ts #pragma omp task depend(inout: a[k][k])
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
OpenMP 3.1 } OpenMP 4.0

21 May 2018 UK OpenMP Users' Group 46

Example: Cholesky factorization
Jack Dongarra on OpenMP Task Dependencies:
[…] The appearance of DAG scheduling constructs in the
OpenMP 4.0 standard offers a particularly important example
of this point. Until now, libraries like PLASMA had to rely on
custom built task schedulers; […] However, the inclusion of DAG
scheduling constructs in the OpenMP standard, along with the
rapid implementation of support for them [...], throws open
the doors to widespread adoption of this model in academic
and commercial applications for shared memory. We view
OpenMP as the natural path forward for the PLASMA library
and expect that others will see the same advantages to
choosing this alternative.

Full article
here: http://www.hpcwire.com/2015/10/19/numerical-
algorithms-and-libraries-at-exascale/

Using 2017 Intel compiler

21 May 2018 UK OpenMP Users' Group 47
The taskloop Construct
• Parallelize a loop using OpenMP tasks
• Cut loop into chunks
• Create a task for each loop chunk

• Syntax (C/C++)
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops

• Syntax (Fortran)
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]

21 May 2018 UK OpenMP Users' Group 48

Clauses for taskloop Construct
• Taskloop construct inherits clauses both from worksharing constructs
and the task construct
• shared, private
• firstprivate, lastprivate
• default
• collapse
• final, untied, mergeable

• grainsize(grain-size)
Chunks have at least grain-size and max 2*grain-size loop iterations

• num_tasks(num-tasks)
Create num-tasks tasks for iterations of the loop

21 May 2018 UK OpenMP Users' Group 49

Example: Sparse CG
for (iter = 0; iter < sc->maxIter; iter++)
{
precon(A, r, z);
vectorDot(r, z, n, &rho); void matvec(Matrix *A, double *x, double *y) {
beta = rho / rho_old; // ...
xpay(z, beta, n, p); #pragma omp parallel for \
matvec(A, p, q);
private(i,j,is,ie,j0,y0) \
vectorDot(p, q, n, &dot_pq);
alpha = rho / dot_pq; schedule(static)
axpy(alpha, p, n, x); for (i = 0; i < A->n; i++) {
axpy(-alpha, q, n, r); y0 = 0;
sc->residual = sqrt(rho) * bnrm2; is = A->ptr[i];
if (sc->residual <= sc->tolerance) ie = A->ptr[i + 1];
break; for (j = is; j < ie; j++) {
rho_old = rho; j0 = index[j];
} y0 += value[j] * x[j0];
}
y[i] = y0;
}
// ...
}

21 May 2018 UK OpenMP Users' Group 50

Example: Sparse CG
#pragma omp parallel
#pragma omp single
for (iter = 0; iter < sc->maxIter; iter++)
{
precon(A, r, z); void matvec(Matrix *A, double *x, double *y) {
vectorDot(r, z, n, &rho); // ...
beta = rho / rho_old;
xpay(z, beta, n, p); #pragma omp taskloop private(j,is,ie,j0,y0) \
matvec(A, p, q); grain_size(500)
vectorDot(p, q, n, &dot_pq); for (i = 0; i < A->n; i++) {
alpha = rho / dot_pq; y0 = 0;
axpy(alpha, p, n, x); is = A->ptr[i];
axpy(-alpha, q, n, r); ie = A->ptr[i + 1];
sc->residual = sqrt(rho) * bnrm2; for (j = is; j < ie; j++) {
if (sc->residual <= sc->tolerance) j0 = index[j];
break; y0 += value[j] * x[j0];
rho_old = rho; }
} y[i] = y0;
}
// ...
}

21 May 2018 UK OpenMP Users' Group 51

Conclusions
• Tasking allows you
• to exploit recursive
parallelism which is hard to
do with classic worksharing
• to exploit parallelism in
places where there are
complicated data-flow
dependences between
computations
• to go beyond threads

21 May 2018 UK OpenMP Users' Group 52

NUMA Awarenesss
OpenMP and Performance

• Two of the more obscure things that can negatively impact

performance are cc-NUMA effects and false sharing

• Neither of these are inherent to OpenMP

• But they most show up because you used OpenMP
• In any case they are important enough to cover here
Non-uniform Memory

double* A;
Core Core Core Core
A = (double*)
malloc(N * sizeof(double)); on-chip on-chip on-chip on-chip
cache cache cache cache
for (int i = 0; i < N; i++) {
A[i] = 0.0;
interconnect
}

memory memory
Non-uniform Memory
Serial code: all array elements are
allocated in the memory of the NUMA
node closest to the core executing the Core Core Core Core
initializer thread (first touch)
on-chip on-chip on-chip on-chip
cache cache cache cache
double* A;
A = (double*)
malloc(N * sizeof(double));
interconnect

for (int i = 0; i < N; i++) {

memory memory
A[i] = 0.0;
}
A[0] … A[N]
First Touch Memory Placement
First Touch w/ parallel code: all array
elements are allocated in the memory
of the NUMA node that contains the
core that executes the thread that Core Core Core Core
initializes the partition on-chip on-chip on-chip on-chip
cache cache cache cache
double* A;
A = (double*)
malloc(N * sizeof(double));
interconnect
omp_set_num_threads(2);

#pragma omp parallel for

for (int i = 0; i < N; i++) { memory memory
A[i] = 0.0;
}
A[0] … A[N/2] A[N/2] … A[N]
Serial vs. Parallel Initialization
• Stream example with and without parallel initialization.
• 2 socket sytem with Xeon X5675 processors, 12 OpenMP threads
copy scale add triad
ser_init 18.8 GB/s 18.5 GB/s 18.1 GB/s 18.2 GB/s
par_init 41.3 GB/s 39.3 GB/s 40.3 GB/s 40.4 GB/s
ser_init: a[0,N-1] T1 T2 T3 T7 T8 T9
b[0,N-1]
MEM CPU 0 CPU 1 MEM
c[0,N-1]
T4 T5 T6 T10 T11 T12

par_init: a[0,(N/2)-1] T1 T2 T3 T7 T8 T9 a[N/2,N-1]

b[0,(N/2)-1]
MEM CPU 0 CPU 1 b[N/2,N-1]
MEM
c[0,(N/2)-1] c[N/2,N-1]
T4 T5 T6 T10 T11 T12
Get Information about the System Topology
• Before you design a strategy for thread binding, you should have a
basic understanding of the system topology. Please use one of the
following options on a target machine:
• Intel MPI‘s cpuinfo tool
• module switch openmpi intelmpi
• cpuinfo
• Delivers information about the number of sockets (= packages) and the mapping of
processor IDs to CPU cores used by the OS
• hwlocs‘ hwloc-ls tool
• hwloc-ls
• Displays a graphical representation of the system topology, separated into NUMA nodes,
along with the mapping of processor IDs to CPU cores used by the OS and additional
information on caches
Decide for Binding Strategy
• Selecting the „right“ binding strategy depends not only on the
topology, but also on the characteristics of your application.
• Putting threads far apart, i.e., on different sockets
• May improve the aggregated memory bandwidth available to your application
• May improve the combined cache size available to your application
• May decrease performance of synchronization constructs
• Putting threads close together, i.e., on two adjacent cores that possibly share
some caches
• May improve performance of synchronization constructs
• May decrease the available memory bandwidth and cache size
• If you are unsure, just try a few options and then select the best one.
OpenMP 4.0: Places + Policies
• Define OpenMP places
• set of OpenMP threads running on one or more processors
• can be defined by the user, i.e., OMP_PLACES=cores

• Define a set of OpenMP thread affinity policies

• SPREAD: spread OpenMP threads evenly among the places,
partition the place list
• CLOSE: pack OpenMP threads near master thread
• MASTER: collocate OpenMP thread with master thread

• Goals
• user has a way to specify where to execute OpenMP threads for locality between
OpenMP threads / less false sharing / memory bandwidth
OMP_PLACES Environment Variable
• Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7

• 2 sockets, 4 cores per socket, 4 hyper-threads per core

• Abstract names for OMP_PLACES:

• threads: Each place corresponds to a single hardware thread on the target machine.
• cores: Each place corresponds to a single core (having one or more hardware
threads) on the target machine.
• sockets: Each place corresponds to a single socket (consisting of one or more cores)
on the target machine.
OpenMP 4.0: Places and Binding Policies
• Example‘s objective:
• separate cores for outer loop and near cores for inner loop
• Outer parallel region: proc_bind(spread), Inner: proc_bind(close)
• spread creates partition, compact binds threads within respective partition
OMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-3):8:4 = cores
#pragma omp parallel proc_bind(spread) num_threads(4)
#pragma omp parallel proc_bind(close) num_threads(4)

p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3 p4 p5 p6 p7
More Examples (1/3)
• Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7

• 2 sockets, 4 cores per socket, 4 hyper-threads per core

• Parallel Region with two threads, one per socket

• OMP_PLACES=sockets
• #pragma omp parallel num_threads(2) \
proc_bind(spread)
More Examples (2/3)
• Assume the following machine:
p0 p1 p2 p3 p4 p5 p6 p7

• Parallel Region with four threads, one per core, but only on the first
socket
• OMP_PLACES=cores
• #pragma omp parallel num_threads(4) \
proc_bind(close)
More Examples (3/3)
• Spread a nested loop first across two sockets, then among the cores
within each socket, only one thread per core
• OMP_PLACES=cores
• #pragma omp parallel num_threads(2) \
proc_bind(spread)
#pragma omp parallel num_threads(4) \
proc_bind(close)
Places API: Example
• Simple routine printing the processor ids of the place the calling
thread is bound to:
void print_binding_info() {
int my_place = omp_get_place_num();
int place_num_procs = omp_get_place_num_procs(my_place);

printf("Place consists of %d processors: ", place_num_procs);

int place_processors = malloc(sizeof(int) place_num_procs);

omp_get_place_proc_ids(my_place, place_processors)

for (int i = 0; i < place_num_procs - 1; i++) {

printf("%d ", place_processors[i]);
}
printf("\n");

free(place_processors);
}
A First Summary
• Everything is under control now?
• In principle yes, but only if
• threads can be bound explicitly,
• data can be placed well by first-touch, or can be migrated,
• you focus on a specific platform (= os + arch) → no portability

• What if the data access pattern changes over time?

• What if you use more than one level of parallelism?

NUMA Strategies: Overview
• First Touch: Modern operating systems (i.e., Linux >= 2.4) determine the
physical location of a memory page during the first page fault, when the
page is first „touched“, and put it close to the CPU that causes the page
fault

• Explicit Migration: Selected regions of memory (pages) are moved from

one NUMA node to another via explicit OS syscall

• Next Touch: The binding of pages to NUMA nodes is removed and pages
are put in the location of the next „touch“; well supported in Solaris,
expensive to implement in Linux

• Automatic Migration: No support for this in current operating systems

User Control of Memory Affinity
• Explicit NUMA-aware memory allocation:
• By carefully touching data by the thread which later uses it
• By changing the default memory allocation strategy
• Linux: numactl command
• By explicit migration of memory pages
• Linux: move_pages()

• Example: using numactl to distribute pages round-robin:

• numactl –interleave=all ./a.out
OpenMP Memory Allocators (v5.0)
• New clause on all constructs with data sharing clauses:
• allocate( [allocator:] list )

• Allocation:
• omp_alloc(size_t size, omp_allocator_t *allocator)
• Deallocation:
• omp_free(void *ptr, const omp_allocator_t *allocator)
• allocator argument is optional

• allocate directive
• Standalone directive for allocation, or declaration of allocation stmt.

21
Example: Using Memory Allocators (v5.0)
void allocator_example(omp_allocator_t *my_allocator) {
int a[M], b[N], c;
#pragma omp allocate(a) allocator(omp_high_bw_mem_alloc)
#pragma omp allocate(b) // controlled by OMP_ALLOCATOR and/or omp_set_default_allocator
omp_alloc(N*M*sizeof(*p), my_allocator);
double *p = (double *) malloc(N*M*sizeof(*p));

#pragma omp parallel private(a) allocate(my_allocator:a)

{
some_parallel_code();
}

#pragma omp target firstprivate(c) allocate(omp_const_mem_alloc:c) // on target; must be compile-time expr

{
#pragma omp parallel private(a) allocate(omp_high_bw_mem_alloc:a)
{
some_other_parallel_code();
}
}

omp_free(p);
}

22
OpenMP Task Affinity (v5.0)
• OpenMP version 5.0 will support task affinity
#pragma omp task affinity(<var-reference>)
• Task-to-data affinity
• Hint to execute task as close as possible to the location of the data

23
OpenMP Task Affinity

void task_affinity() {
Core Core Core Core
double* B;
#pragma omp task shared(B)
on-chip on-chip on-chip on-chip
{
cache cache cache cache
B = init_B_and_important_computation(A);
}
#pragma omp task firstprivate(B)
{ interconnect
important_computation_too(B);
}
#pragma omp taskwait
}

memory memory

A[0] … A[N] B[0] … B[N]

24
OpenMP Task Affinity

void task_affinity() {
Core Core Core Core
double* B;
#pragma omp task shared(B) affinity(A[0:N])
on-chip on-chip on-chip on-chip
{
cache cache cache cache
B = init_B_and_important_computation(A);
}
#pragma omp task firstprivate(B) affinity(B[0:N])
{ interconnect
important_computation_too(B);
}
#pragma omp taskwait
}

memory memory

A[0] … A[N]
B[0] … B[N]
25
Partitioning Memory w/ OpenMP version 5.0
void allocator_example() {
double *array;

omp_allocator_t *allocator;
omp_alloctrait_t traits[] = {
{OMP_ATK_PARTITION, OMP_ATV_BLOCKED}
};
int ntraits = sizeof(traits) / sizeof(*traits);
allocator = omp_init_allocator(omp_default_mem_space, ntraits, traits);

array = omp_alloc(sizeof(array) N, allocator);

#pragma omp parallel for proc_bind(spread)

for (int i = 0; i < N; ++i) {
important_computation(&array[i]);
}

omp_free(array);
}

26
Summary
• (Correct) memory placement is crucial for performance for most
applications

• OpenMP programmers can exploit placement policies to align data

with compute threads

• OpenMP version 5.0 will bring additional features for more portable
memory optimizations

27
OpenMP SIMD Programming
Evolution of SIMD on Intel® Architectures
128 bit

2 x DP
SSE
4 x SP

256 bit

4 x DP
AVX
8 x SP

512 bit

AVX-512 8 x DP

16 x SP

2
SIMD Instructions –Arithmetic Instructions
Operations work on each individual SIMD element
vaddpd dest, source1, source2
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2
=
a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0
dest

3
SIMD Instructions – Fused Instructions
Two operations (e.g., multiply & add) fused into one SIMD instruction
vfmadd213pd source1, source2, source3
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1

*
b7 b6 b5 b4 b3 b2 b1 b0 source2

+
c7 c6 c5 c4 c3 c2 c1 c0 source3

=
a7*b7 a6*b6 a5*b5 a4 *b4 a3*b3 a2*b2 a1*b1 a0*b0 dest (=source1)
+c7 +c6 +c5 +c4 +c3 +c2 +c1 +c0

4
SIMD Instructions – Conditional Evaluation
Mask register limit effect of instructions to a subset of the SIMD
elements vaddpd dest{k1}, source1, source2
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source1
+
b7 b6 b5 b4 b3 b2 b1 b0 source2

True False True False False True False True mask

=
a7+b7 d6 a5+b5 d4 d3 a2+b2 d1 a0+b0 dest

5
SIMD Instructions – Broadcast
Assign a scalar value to all SIMD elements
vbroadcast dest, scalar
512 bit
s scalar

s s s s s s s s dest

6
SIMD Instructions – Shuffles, Swizzles, Blends
Instruction to modify data layout in the SIMD register
vmovapd dest, source{dacb}
512 bit
a7 a6 a5 a4 a3 a2 a1 a0 source
swizzle

a7 a4 a6 a5 a3 a0 a2 a1 “tmp”
“move”

a7 a4 a6 a5 a3 a0 a2 a1 dest

7
Auto-vectorization
• Compilers offer auto-vectorization as an optimization pass
• Usually part of the general loop optimization passes
•
•
•
Code analysis detects code properties that inhibit SIMD vectorization
Heuristics determine if SIMD execution might be beneficial
If all goes well, the compiler will generate SIMD instructions
?
• Example: Intel® Composer XE
• -vec (automatically enabled with –O2)
• -qopt-report

8
Interlude: Data Dependencies
• Suppose two statements S1 and S2
• S2 depends on S1, iff S1 must execute before S2
• Control-flow dependence
• Data dependence
• Dependencies can be carried over between loop iterations
• Important flavors of data dependencies
FLOW ANTI
s1: a = 40 b = 40
b = 21 s1: a = b + 1
s2: c = a + 2 s2: b = 21

9
Interlude: Loop-carried Dependencies
• Dependencies may occur across loop iterations
• Then they are called “loop-carried dependencies”
• “Distance” of a dependency: number of loop iterations the dependency spans
• The following code contains such a dependency:
void lcd_ex(float* a, float* b, size_t n, float c1, float c2) {
for (int i = 0; i < n; i++) {
a[i] = c1 * a[i + 17] + c2 * b[i];
} } Loop-carried dependency for a[i]
and a[i+17]; distance is 17.

• Some iterations of the loop have to complete before the next

iteration can run
• Simple trick: Can you reverse the loop w/o getting wrong results?
• Note: This condition is sufficient, but not necessary!
10
Interlude: Loop-carried Dependencies
• Can we parallelize or vectorize the loop?

0 1 2 3 17 18 19 20

void lcd_ex(float* a, float* b, size_t n, float c1, float c2) {

for (int i = 0; i < n; i++) {
a[i] = c1 * a[i + 17] + c2 * b[i];
} }

• Parallelization: no
(except for very specific loop schedules)
• Vectorization: yes
(iff vector length is shorter than any distance of any dependency)

11
Why Auto-vectorizers Fail
• Data dependencies
• Other potential reasons
• Alignment
• Function calls in loop block
• Complex control flow / conditional branches
• Loop not “countable”
• E.g. upper bound not a runtime constant
• Mixed data types
• Non-unit stride between elements
• Loop body too complex (register pressure)
• Vectorization seems inefficient
• Many more … but less likely to occur
12
Example: Loop not Countable
• “Loop not Countable” plus “Assumed Dependencies”
typedef struct {
float* data;
int size;
} vec_t;

void vec_eltwise_product(vec_t* a, vec_t* b, vec_t* c) {

for (int i = 0; i < a->size; i++) {
c->data[i] = a->data[i] * b->data[i];
}
}

13
OpenMP SIMD Loop Construct
• Vectorize a loop nest
• Cut loop into chunks that fit a SIMD vector register
• No parallelization of the loop body

• Syntax (C/C++)
#pragma omp simd [clause[[,] clause],…]
for-loops

• Syntax (Fortran)
!$omp simd [clause[[,] clause],…]
do-loops

14
Example
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}

vectorize

15
Data Sharing Clauses
• private(var-list):
Uninitialized vectors for variables in var-list
x: 42 ? ? ? ?

• firstprivate(var-list):
Initialized vectors for variables in var-list
x: 42 42 42 42 42

• reduction(op:var-list):
Create private variables for var-list and apply reduction operator op at the end of the construct
12 5 8 17 x: 42

16
SIMD Loop Clauses
• safelen (length)
• Maximum number of iterations that can run concurrently without breaking a
dependence
• In practice, maximum vector length
• linear (list[:linear-step])
• The variable’s value is in relationship with the iteration number
• xi = xorig + i * linear-step
• aligned (list[:alignment])
• Specifies that the list items have a given alignment
• Default is alignment for the architecture
• collapse (n)

17
SIMD Worksharing Construct
• Parallelize and vectorize a loop nest
• Distribute a loop’s iteration space across a thread team
• Subdivide loop chunks to fit a SIMD vector register
• Syntax (C/C++)
#pragma omp for simd [clause[[,] clause],…]
for-loops
• Syntax (Fortran)
!$omp do simd [clause[[,] clause],…]
do-loops
[!$omp end do simd [nowait]]

18
Example
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}

parallelize
Thread 0 Thread 1 Thread 2

Remainder Loop Peel Loop

vectorize

19
Be Careful What You Wish For…
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}
• You should choose chunk sizes that are multiples of the SIMD length
• Remainder loops are not triggered
• Likely better performance
• In the above example …
• and AVX2 (= 8-wide), the code will only execute the remainder loop!
• and SSE (=4-wide), the code will have one iteration in the SIMD loop plus one in the remainder loop!

20
OpenMP 4.5 SIMD Chunks
void sprod(float *a, float *b, int n) {
float sum = 0.0f;
#pragma omp for simd reduction(+:sum) \
schedule(simd: static, 5)
for (int k=0; k<n; k++)
sum += a[k] * b[k];
return sum;
}

• Chooses chunk sizes that are multiples of the SIMD length

• First and last chunk may be slightly different to fix alignment and to handle loops that are not exact
multiples of SIMD width
• Remainder loops are not triggered
• Likely better performance

21
SIMD Function Vectorization
float min(float a, float b) {
return a < b ? a : b;
}

float distsq(float x, float y) {

return (x - y) * (x - y);
}

void example() {
#pragma omp parallel for simd
for (i=0; i<N; i++) {
d[i] = min(distsq(a[i], b[i]), c[i]);
} }

22
SIMD Function Vectorization
• Declare one or more functions to be compiled for calls from a SIMD-
parallel loop
• Syntax (C/C++):
#pragma omp declare simd [clause[[,] clause],…]
[#pragma omp declare simd [clause[[,] clause],…]]
[…]
function-definition-or-declaration
• Syntax (Fortran):
!$omp declare simd (proc-name-list)

23
SIMD Function Vectorization
#pragma omp declare simd _ZGVZN16vv_min(%zmm0, %zmm1):
float min(float a, float b) { vminps %zmm1, %zmm0, %zmm0
return a < b ? a : b; ret
}

#pragma omp declare simd _ZGVZN16vv_distsq(%zmm0, %zmm1):

float distsq(float x, float y) { vsubps %zmm0, %zmm1, %zmm2
return (x - y) * (x - y); vmulps %zmm2, %zmm2, %zmm0
} ret
void example() {
#pragma omp parallel for simd
for (i=0; i<N; i++) {
vmovups
d[i] = min(distsq(a[i], b[i]), c[i]);(%r14,%r12,4), %zmm0
} } vmovups (%r13,%r12,4), %zmm1
AT&T syntax: destination operand is on the right call _ZGVZN16vv_distsq
vmovups (%rbx,%r12,4), %zmm1
call _ZGVZN16vv_min 24
SIMD Function Vectorization
• simdlen (length)
• generate function to support a given vector length
• uniform (argument-list)
• argument has a constant value between the iterations of a given loop
• inbranch
• optimize for function always called from inside an if statement
• notinbranch
• function never called from inside an if statement
• linear (argument-list[:linear-step])
• aligned (argument-list[:alignment])

25
SIMD Constructs & Performance
5,00x
4,34x ICC auto-vec
4,50x
ICC SIMD directive
4,00x 3,66x
3,50x
relative speed-up
(higher is better)

3,00x
2,40x
2,50x 2,13x
2,04x
2,00x
1,47x
1,50x
1,00x
0,50x
0,00x
Mandelbrot Volume BlackScholes Fast Walsh Perlin Noise SGpp
Rendering

Klemm, A.Duran, X.Tian, H.Saito, D.Caballero, and X.Martorell. Extending OpenMP with Vector Constructs for Modern
Multicore SIMD Architectures. In Proc. of the Intl. Workshop on OpenMP, pages 59-72, Rome, Italy, June 2012. LNCS 7312.

26
OpenMPCon & IWOMP 2018
• Tentative dates:
• OpenMPCon: Sep 24-25
• Tutorials: Sep 26
• IWOMP: Sep 27-28

• Co-located with EuroMPI

• Location: Barcelona, Spain (?)

Visit www.openmp.org

28
Summary
• OpenMP provided a powerful, expressive tasking model

• NUMA-aware programming is essential for performance

• OpenMP supports data-parallel instructions through the semi-

automatic SIMD features

• Connect with us to share feedback, comments, concerns, propose

features, or just hang around and have fun
29

OpenMP Workshop Day 1
No ratings yet
OpenMP Workshop Day 1
49 pages
OpenMP Workshop Day 1
No ratings yet
OpenMP Workshop Day 1
56 pages
openMP
No ratings yet
openMP
28 pages
HPC - Unit 3
No ratings yet
HPC - Unit 3
15 pages
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
No ratings yet
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
50 pages
OpenMP Workshop Day 2
No ratings yet
OpenMP Workshop Day 2
155 pages
Lecture 06 - OpenMP
No ratings yet
Lecture 06 - OpenMP
37 pages
L09-openmp for第13周
No ratings yet
L09-openmp for第13周
119 pages
Open MP2362 HHDHD
No ratings yet
Open MP2362 HHDHD
23 pages
ACA 2024W 04 Shared-memory programming with OpenMP 1-15
No ratings yet
ACA 2024W 04 Shared-memory programming with OpenMP 1-15
8 pages
W8L2 OpenMP7 Tasks
No ratings yet
W8L2 OpenMP7 Tasks
21 pages
Openmp 6pp
No ratings yet
Openmp 6pp
5 pages
Worksharing and Parallel Loops
No ratings yet
Worksharing and Parallel Loops
23 pages
Day 2 1 Advanced-Openmp
No ratings yet
Day 2 1 Advanced-Openmp
52 pages
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
62 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
The OpenMP Common Core: Making OpenMP Simple Again
From Everand
The OpenMP Common Core: Making OpenMP Simple Again
Timothy G. Mattson
No ratings yet
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
Omp Handouts
No ratings yet
Omp Handouts
109 pages
OpenMP Tutorial
100% (1)
OpenMP Tutorial
82 pages
Lecture 10 Shared Memory Programming with OpenMP.pptx
No ratings yet
Lecture 10 Shared Memory Programming with OpenMP.pptx
30 pages
OMP Common Core-Voss
No ratings yet
OMP Common Core-Voss
217 pages
Tutorial Presentation 8
No ratings yet
Tutorial Presentation 8
15 pages
OpenMP_SPM
No ratings yet
OpenMP_SPM
9 pages
Openmp
No ratings yet
Openmp
21 pages
04
No ratings yet
04
39 pages
CS-3006_8_UsingOpenMP_SharedMemoryProgramming
No ratings yet
CS-3006_8_UsingOpenMP_SharedMemoryProgramming
61 pages
ipc_assig 1
No ratings yet
ipc_assig 1
9 pages
CS-3006 5 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 5 UsingOpenMP SharedMemoryProgramming
76 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
OpenMPSlides Tamu SC
No ratings yet
OpenMPSlides Tamu SC
80 pages
OPENMP
No ratings yet
OPENMP
37 pages
OpenMP P1
No ratings yet
OpenMP P1
32 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
Openmp 2pp
No ratings yet
Openmp 2pp
15 pages
4 Openmp
No ratings yet
4 Openmp
32 pages
UNIT III
No ratings yet
UNIT III
61 pages
10 OpenMP-2
No ratings yet
10 OpenMP-2
25 pages
Spirit Flow-Xpert: Flow Measurement Calculation Software
No ratings yet
Spirit Flow-Xpert: Flow Measurement Calculation Software
12 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
Omp Hands On SC08
No ratings yet
Omp Hands On SC08
153 pages
OpenMP 01 Introduction
No ratings yet
OpenMP 01 Introduction
70 pages
OpenMPSlides Tamu SC PDF
No ratings yet
OpenMPSlides Tamu SC PDF
74 pages
OpenMP Examples
No ratings yet
OpenMP Examples
12 pages
Parallel Programming Using Openmp: Mike Bailey
No ratings yet
Parallel Programming Using Openmp: Mike Bailey
27 pages
4 Performance.4x
No ratings yet
4 Performance.4x
14 pages
Num Tech
No ratings yet
Num Tech
39 pages
Open MP
No ratings yet
Open MP
30 pages
Openmp: Parallel Processing
No ratings yet
Openmp: Parallel Processing
40 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
46 pages
OpenMP Presentation
No ratings yet
OpenMP Presentation
51 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
NumPy Recipes
From Everand
NumPy Recipes
Martin McBride
No ratings yet
M880 User Manual
No ratings yet
M880 User Manual
103 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Logit and Probit Models
50% (2)
Logit and Probit Models
11 pages
SVVT Paper Unit 1
No ratings yet
SVVT Paper Unit 1
9 pages
3BSE067123 A en AC 800M HI Offline Proof Test Description Technical Description
100% (1)
3BSE067123 A en AC 800M HI Offline Proof Test Description Technical Description
30 pages
iot-based-polyhouse-farming-with-controlled-environment-and-monitoring-IJERTV11IS120127
No ratings yet
iot-based-polyhouse-farming-with-controlled-environment-and-monitoring-IJERTV11IS120127
6 pages
Mapo Lcy
No ratings yet
Mapo Lcy
76 pages
Effective Communication Skills For Scientific and Technical Professionals PDF
No ratings yet
Effective Communication Skills For Scientific and Technical Professionals PDF
334 pages
Transformation of Functions
No ratings yet
Transformation of Functions
38 pages
f_cv
No ratings yet
f_cv
2 pages
Greedy Algorithms: Lecture 3: Summer of Competitive Programming
No ratings yet
Greedy Algorithms: Lecture 3: Summer of Competitive Programming
20 pages
New Features of J2SE 5.0: Dr. Stephan Fischli
No ratings yet
New Features of J2SE 5.0: Dr. Stephan Fischli
46 pages
Peter Lonsdalle PDF
No ratings yet
Peter Lonsdalle PDF
22 pages
Algebra, Trigonometria Y Geometria Analitica: Amalfi Galindo Ospino
No ratings yet
Algebra, Trigonometria Y Geometria Analitica: Amalfi Galindo Ospino
19 pages
Canon PowerShot G5 Camara
No ratings yet
Canon PowerShot G5 Camara
125 pages
Rone
No ratings yet
Rone
9 pages
Koe064 Object Oriented Programming
No ratings yet
Koe064 Object Oriented Programming
1 page
ANSYS Polyflow Tutorial Guide r170
No ratings yet
ANSYS Polyflow Tutorial Guide r170
426 pages
Standard Resin Printing Parameter Setting Guide
No ratings yet
Standard Resin Printing Parameter Setting Guide
1 page
Lord Core - Pack File List
No ratings yet
Lord Core - Pack File List
7 pages
Stuxnet Virus
No ratings yet
Stuxnet Virus
28 pages
Digit 7 Segment LED Displays With Cascadable Serial Driver - B14SC04 B20SC04 B25SC04 B38SC04
No ratings yet
Digit 7 Segment LED Displays With Cascadable Serial Driver - B14SC04 B20SC04 B25SC04 B38SC04
8 pages
Capacity-Demand-Diagram Methods For Estimating Deformation of Inelastic Systems
No ratings yet
Capacity-Demand-Diagram Methods For Estimating Deformation of Inelastic Systems
74 pages
India's Data Usage Per Smartphone Highest in World at 9.8GB/month: Report
No ratings yet
India's Data Usage Per Smartphone Highest in World at 9.8GB/month: Report
4 pages
DSO Nano Firmware Generation and Upgrade
No ratings yet
DSO Nano Firmware Generation and Upgrade
6 pages
NYC Taxi Data Analysis
No ratings yet
NYC Taxi Data Analysis
8 pages
Exp 1
No ratings yet
Exp 1
2 pages
Realizing The 5G FWA Growth Opportunity
No ratings yet
Realizing The 5G FWA Growth Opportunity
6 pages
Driver Diodo Laser
No ratings yet
Driver Diodo Laser
1 page
Housekeeping NCII First Quarter Pointers
No ratings yet
Housekeeping NCII First Quarter Pointers
1 page