0% found this document useful (0 votes)

219 views153 pages

Omp Hands On SC08 PDF

The document provides an overview and outline of a hands-on introduction to OpenMP. It discusses preliminaries such as disclosures and the plan for the day which involves short lectures mixed with exercises. The outline lists 8 topics that will be covered throughout the day, including OpenMP introduction, creating threads, synchronization, parallel loops, and OpenMP 3.0 features like tasks. Exercises are provided to teach key OpenMP concepts in an active learning approach.

Uploaded by

andres python

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

219 views153 pages

Omp Hands On SC08 PDF

Uploaded by

andres python

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 153

A “Hands-on” Introduction to

OpenMP*

Tim Mattson Larry Meadows

Principal Engineer Principal Engineer
Intel Corporation Intel Corporation
[email protected] [email protected]

1
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
Preliminaries: part 1
z Disclosures
The views expressed in this tutorial are those of the
people delivering the tutorial.
– We are not speaking for our employers.
– We are not speaking for the OpenMP ARB
z This is a new tutorial for us:
Help us improve … tell us how you would make this
tutorial better.

2
Preliminaries: Part 2
z Our plan for the day .. Active learning!
We will mix short lectures with short exercises.
You will use your laptop for the exercises … that
way you’ll have an OpenMP environment to take
home so you can keep learning on your own.
z Please follow these simple rules
Do the exercises we assign and then change things
around and experiment.
– Embrace active learning!
Don’t cheat: Do Not look at the solutions before
you complete an exercise … even if you get really
frustrated.

3
Our Plan for the day
Topic Exercise concepts
I. OMP Intro Install sw, Parallel regions
hello_world
II. Creating threads Pi_spmd_simple Parallel, default data
environment, runtime library
calls
Break
III. Synchronization Pi_spmd_final False sharing, critical, atomic

IV. Parallel loops Pi_loop For, reduction

V. Odds and ends No exercise Single, master, runtime
libraries, environment
variables, synchronization, etc.
lunch
VI. Data Environment Pi_mc Data environment details,
modular software,
threadprivate
VII. Worksharing and Linked list, For, schedules, sections
schedule matmul
Break
VIII. Memory model Producer Point to point synch with flush
consumer
IX OpenMP 3 and tasks Linked list Tasks and other OpenMP 3
features 4
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
5
OpenMP* Overview:
C$OMP FLUSH #pragma omp critical
C$OMP THREADPRIVATE(/ABC/) CALL OMP_SET_NUM_THREADS(10)
C$OMP OpenMP: An API for
parallel do shared(a, Writing
b, c) Multithreaded
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
Applications
C$OMP MASTER
C$OMP ATOMIC
C$OMP A set
SINGLE of compiler
PRIVATE(X) directives
setenv and library
OMP_SCHEDULE “dynamic”
routines for parallel application programmers
C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP ORDERED
Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C++ C$OMP SECTIONS
C$OMP PARALLEL REDUCTION (+: A, B)

parallel for last

Standardizes
#pragma omp 20 years
private(A, B) of !$OMP
SMP practice
BARRIER
C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(XX)

Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck)

6
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
OpenMP Basic Defs: Solution Stack
User layer

End User

Application

Environment
Layer

Directives,
Prog.

OpenMP library
Compiler variables
System layer

OpenMP Runtime library

OS/system support for shared memory and threading

Proc1 Proc2 Proc3 ProcN

Shared Address Space

7
OpenMP core syntax
z Most of the constructs in OpenMP are compiler
directives.
#pragma omp construct [clause [clause]…]
Example
#pragma omp parallel num_threads(4)
z Function prototypes and types in the file:
#include <omp.h>
z Most OpenMP* constructs apply to a
“structured block”.
Structured block: a block of one or more statements
with one point of entry at the top and one point of
exit at the bottom.
It’s OK to have an exit() within the structured block.
8
Exercise 1, Part A: Hello world
Verify that your environment works
z Write a program that prints “hello world”.

void
void main()
main()
{{

int
intID
ID==0;
0;
printf(“
printf(“hello(%d)
hello(%d)”,”,ID);
ID);
printf(“
printf(“world(%d)
world(%d)\n”,
\n”,ID);
ID);

}} 9
Exercise 1, Part B: Hello world
Verify that your OpenMP environment works
z Write a multithreaded program that prints “hello world”.

#include “omp.h”
void
void main()
main()
{{
#pragma omp parallel Switches for compiling and linking
-fopenmp gcc
{
-mp pgi
int
intID
ID==0;
0; /Qopenmp intel

printf(“
printf(“hello(%d)
hello(%d)”,”,ID);
ID);
printf(“
printf(“world(%d)
world(%d)\n”,
\n”,ID);
ID);
}
}} 10
Exercise 1: Solution
A multi-threaded “Hello world” program
z Write a multithreaded program where each
thread prints “hello world”.
#include “omp.h” OpenMP
OpenMP include
include file
file
#include “omp.h”
void
void main()
main() Parallel region with default
Parallel region with default Sample Output:
{{ number
numberof ofthreads
threads
Sample Output:
#pragma
#pragmaomp ompparallel
parallel hello(1)
hello(1) hello(0)
hello(0) world(1)
world(1)
{{ world(0)
world(0)
int
intID
ID==omp_get_thread_num();
omp_get_thread_num(); hello hello(3)(3)hello(2)
hello(2)world(3)
world(3)
printf(“
printf(“hello(%d)
hello(%d)”,”,ID); ID);
printf(“
printf(“world(%d)
world(%d)\n”, \n”,ID);
ID); world(2)
world(2)
}} Runtime
Runtimelibrary
libraryfunction
functionto
to
}} End of the Parallel region
End of the Parallel region return a thread ID.
return a thread ID.
11
OpenMP Overview:
How do threads interact?
z OpenMP is a multi-threading, shared address
model.
– Threads communicate by sharing variables.
z Unintended sharing of data causes race
conditions:
– race condition: when the program’s outcome
changes as the threads are scheduled differently.
z To control race conditions:
– Use synchronization to protect data conflicts.
z Synchronization is expensive so:
– Change how data is accessed to minimize the need
for synchronization. 12
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
13
OpenMP Programming Model:
Fork-Join Parallelism:
Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance goals
are met: i.e. the sequential program evolves into a
parallel program.
Parallel Regions AANested
Nested
Master Parallel
Parallel
Thread region
region
in red

Sequential Parts 14
Thread Creation: Parallel Regions

z You create threads in OpenMP* with the parallel

construct.
z For example, To create a 4 thread Parallel region:

double A[1000]; Runtime

Runtimefunction
functiontoto
Each
Eachthread
thread omp_set_num_threads(4); request
requestaacertain
certain
executes
executes aa number
number of
of threads
threads
copy
#pragma omp parallel
copyof ofthe
the
code {
code within
within
the int ID = omp_get_thread_num();
the
structured
structured pooh(ID,A); Runtime
Runtimefunction
function
block
block } returning
returningaathread
threadIDID
z Each thread calls pooh(ID,A) for ID = 0 to 3
15
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
Thread Creation: Parallel Regions
z You create threads in OpenMP* with the parallel
construct.
z For example, To create a 4 thread Parallel region:
clause
clauseto
torequest
requestaacertain
certain
double A[1000]; number
numberofofthreads
threads
Each
Eachthread
thread
executes
executes aa
copy
#pragma omp parallel num_threads(4)
copyof ofthe
the
code {
code within
within
the int ID = omp_get_thread_num();
the
structured
structured pooh(ID,A); Runtime
Runtimefunction
function
block
block } returning
returningaathread
threadID
ID
z Each thread calls pooh(ID,A) for ID = 0 to 3
16
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
Thread Creation: Parallel Regions example

double A[1000];
z Each thread executes the omp_set_num_threads(4);
same code redundantly. #pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);
AAsingle
single
copy
copyofofAA
is
isshared
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between
betweenallall
threads.
threads.
printf(“all done\n”); Threads
Threadswait
wait here
here for
forall
allthreads
threadsto to
finish
finishbefore
beforeproceeding
proceeding(i.e.
(i.e.aabarrier)
barrier) 17
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
Exercises 2 to 4:
Numerical Integration
Mathematically, we know that:
1
4.0

∫
4.0
(1+x2) dx = π
0

We can approximate the

F(x) = 4.0/(1+x2)

integral as a sum of
2.0
rectangles:
N

∑ F(x )Δx ≈ π
i
i=0

1.0
Where each rectangle has
0.0
X width Δx and height F(xi) at
the middle of interval i.
18
Exercises 2 to 4: Serial PI Program

static long num_steps = 100000;

double step;
void main ()
{ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=0;i< num_steps; i++){

x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
19
Exercise 2
z Create a parallel version of the pi program
using a parallel construct.
z Pay close attention to shared versus private
variables.
z In addition to a parallel construct, you will need
the runtime library routines
int omp_get_num_threads(); Number of threads in
the team
int omp_get_thread_num();
double omp_get_wtime();
Thread ID or rank

Time in Seconds since a

20
fixed point in the past
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
21
Synchronization Synchronization is
used to impose order
z High level synchronization: constraints and to
protect access to
– critical
shared data
– atomic
– barrier
– ordered
Discussed
z Low level synchronization later
– flush
– locks (both simple and nested)

22
Synchronization: critical
z Mutual exclusion: Only one thread at a time
can enter a critical region.
float res;
#pragma omp parallel
{ float B; int i, id, nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
Threads
Threadswait
wait
their for(i=id;i<niters;i+nthrds){
theirturn
turn––
only
onlyone
oneatataa B = big_job(i);
time
timecalls
calls #pragma omp critical
consume()
consume() consume (B, res);
}
} 23
Synchronization: Atomic
z Atomic provides mutual exclusion but only
applies to the update of a memory location (the
update of X in the following example)
#pragma omp parallel
{
double tmp, B;
B = DOIT();
tmp = big_ugly(B); Atomic only protects
the read/update of X
#pragma omp atomic
X += big_ugly(B);
tmp;
}
24
Exercise 3
z In exercise 2, you probably used an array to
create space for each thread to store its partial
sum.
z If array elements happen to share a cache line,
this leads to false sharing.
– Non-shared data in the same cache line so each
update invalidates the cache line … in essence
“sloshing independent data” back and forth
between threads.
z Modify your “pi program” from exercise 2 to
avoid false sharing due to the sum array.
25
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
26
SPMD vs. worksharing
z A parallel construct by itself creates an SPMD
or “Single Program Multiple Data” program …
i.e., each thread redundantly executes the
same code.
z How do you split up pathways through the
code between threads within a team?
This is called worksharing
– Loop construct
– Sections/section constructs Discussed later
– Single construct
– Task construct …. Coming in OpenMP 3.0

27
The loop worksharing Constructs
z The loop workharing construct splits up loop
iterations among the threads in a team
#pragma omp parallel
{ Loop construct
#pragma omp for name:
for (I=0;I<N;I++){
•C/C++: for
NEAT_STUFF(I);
} •Fortran: do
}

The variable I is made “private” to each

thread by default. You could do this
explicitly with a “private(I)” clause
28
Loop worksharing Constructs
A motivating example

Sequential code for(i=0;I<N;i++)

for(i=0;I<N;i++) {{a[i]
a[i]==a[i]
a[i]++b[i];}
b[i];}

#pragma
#pragmaompompparallel
parallel
{{
int
intid,
id,i,i,Nthrds,
Nthrds,istart,
istart,iend;
iend;
OpenMP parallel idid==omp_get_thread_num();
omp_get_thread_num();
region Nthrds
Nthrds==omp_get_num_threads();
omp_get_num_threads();
istart
istart==idid**NN//Nthrds;
Nthrds;
iend
iend==(id+1)(id+1)**NN//Nthrds;
Nthrds;
ifif(id
(id==
==Nthrds-1)iend
Nthrds-1)iend==N; N;
for(i=istart;I<iend;i++)
for(i=istart;I<iend;i++) {{a[i] a[i]==a[i]
a[i]++b[i];}
b[i];}
}}

OpenMP parallel
#pragma
#pragmaomp
ompparallel
parallel
region and a
#pragma
#pragmaomp
ompfor
for
worksharing for
for(i=0;I<N;i++)
for(i=0;I<N;i++) {{a[i]
a[i]==a[i]
a[i]++b[i];}
b[i];}
construct 29
Combined parallel/worksharing construct

z OpenMP shortcut: Put the “parallel” and the

worksharing directive on the same line

double res[MAX]; int i; double res[MAX]; int i;

#pragma omp parallel #pragma omp parallel for
{ for (i=0;i< MAX; i++) {
#pragma omp for res[i] = huge();
for (i=0;i< MAX; i++) { }
res[i] = huge();
}
}

These
Theseare
areequivalent
equivalent

30
Working with loops
z Basic approach
Find compute intensive loops
Make the loop iterations independent .. So they can
safely execute in any order without loop-carried
dependencies
Place the appropriate OpenMP directive and test

Note: loop index

int i, j, A[MAX]; “i” is private by int i, A[MAX];
default #pragma omp parallel for
j = 5;
for (i=0;i< MAX; i++) { for (i=0;i< MAX; i++) {
j +=2; int j = 5 + 2*i;
A[i] = big(j); Remove loop A[i] = big(j);
} carried }
dependence
31
Reduction
z How do we handle this case?

double ave=0.0, A[MAX]; int i;

for (i=0;i< MAX; i++) {
ave + = A[i];
}
ave = ave/MAX;

z We are combining values into a single accumulation

variable (ave) … there is a true dependence between
loop iterations that can’t be trivially removed
z This is a very common situation … it is called a
“reduction”.
z Support for reduction operations is included in most
parallel programming environments.

32
Reduction
z OpenMP reduction clause:
reduction (op : list)
z Inside a parallel or a work-sharing construct:
– A local copy of each list variable is made and initialized
depending on the “op” (e.g. 0 for “+”).
– Compiler finds standard reduction expressions containing
“op” and uses them to update the local copy.
– Local copies are reduced into a single value and
combined with the original global value.
z The variables in “list” must be shared in the enclosing
parallel region.
double ave=0.0, A[MAX]; int i;
#pragma omp parallel for reduction (+:ave)
for (i=0;i< MAX; i++) {
ave + = A[i];
}
33
ave = ave/MAX;
OpenMP: Reduction operands/initial-values
z Many different associative operands can be used with reduction:
z Initial values are the ones that make sense mathematically.

Operator Initial value

+ 0 Fortran Only

* 1 Operator Initial value

- 0 .AND. .true.
.OR. .false.
C/C++ only .NEQV. .false.
Operator Initial value .IEOR. 0
.IOR. 0
& ~0
.IAND. All bits on
| 0
.EQV. .true.
^ 0 MIN* Largest pos. number
&& 1 MAX* Most neg. number
|| 0
34
Exercise 4
z Go back to the serial pi program and parallelize
it with a loop construct
z Your goal is to minimize the number changes
made to the serial program.

35
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
36
Synchronization: Barrier
z Barrier: Each thread waits until all threads arrive.
#pragma omp parallel shared (A, B, C) private(id)
{
id=omp_get_thread_num();
A[id] = big_calc1(id);
implicit
implicit barrier
barrier at
at the
the end
end of
of aa
#pragma omp barrier
for
forworksharing
worksharingconstruct
construct
#pragma omp for
for(i=0;i<N;i++){C[i]=big_calc3(i,A);}
#pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }
A[id] = big_calc4(id);
} no implicit barrier
implicit barrier at the end
implicit barrier at the end no implicit barrier
of due
duetotonowait
nowait37
ofaaparallel
parallelregion
region
Master Construct
z The master construct denotes a structured
block that is only executed by the master thread.
z The other threads just skip it (no
synchronization is implied).

#pragma omp parallel

{
do_many_things();
#pragma omp master
{ exchange_boundaries(); }
#pragma omp barrier
do_many_other_things();
}
38
Single worksharing Construct
z The single construct denotes a block of code that is
executed by only one thread (not necessarily the
master thread).
z A barrier is implied at the end of the single block (can
remove the barrier with a nowait clause).

#pragma omp parallel

{
do_many_things();
#pragma omp single
{ exchange_boundaries(); }
do_many_other_things();
}
39
Synchronization: ordered
z The ordered region executes in the sequential
order.

#pragma omp parallel private (tmp)

#pragma omp for ordered reduction(+:res)
for (I=0;I<N;I++){
tmp = NEAT_STUFF(I);
#pragma ordered
res += consum(tmp);
}
40
Synchronization: Lock routines
z Simple Lock routines: A lock implies a
memory fence
A simple lock is available if it is unset. (a “flush”) of
– omp_init_lock(), omp_set_lock(), all thread
visible
omp_unset_lock(), omp_test_lock(), variables
omp_destroy_lock()
z Nested Locks
A nested lock is available if it is unset or if it is set but
owned by the thread executing the nested lock function
– omp_init_nest_lock(), omp_set_nest_lock(),
omp_unset_nest_lock(), omp_test_nest_lock(),
omp_destroy_nest_lock()

Note: a thread always accesses the most recent copy of the

lock, so you don’t need to use a flush on the lock variable. 41
Synchronization: Simple Locks
z Protect resources with locks.

omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel private (tmp, id)
{
id = omp_get_thread_num(); Wait
Waithere
herefor
for
tmp = do_lots_of_work(id); your
yourturn.
turn.
omp_set_lock(&lck); Release
Releasethethelock
lock
printf(“%d %d”, id, tmp); so
sothe
thenext
nextthread
thread
omp_unset_lock(&lck); gets
getsaaturn.
turn.
}
omp_destroy_lock(&lck); Free-up
Free-upstorage
storagewhen
whendone.
done.
42
Runtime Library routines
z Runtime environment routines:
– Modify/Check the number of threads
– omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads()
– Are we in an active parallel region?
– omp_in_parallel()
– Do you want the system to dynamically vary the number of
threads from one parallel construct to another?
– omp_set_dynamic, omp_get_dynamic();
– How many processors in the system?
– omp_num_procs()

…plus a few less commonly used routines.

43
Runtime Library routines
z To use a known, fixed number of threads in a program,
(1) tell the system that you don’t want dynamic adjustment of
the number of threads, (2) set the number of threads, then (3)
save the number you got.
Disable dynamic adjustment of the
#include <omp.h> number of threads.
void main()
{ int num_threads; Request as many threads as
omp_set_dynamic( 0 ); you have processors.
omp_set_num_threads( omp_num_procs() );
#pragma omp parallel
Protect this op since Memory
{ int id=omp_get_thread_num();
stores are not atomic
#pragma omp single
num_threads = omp_get_num_threads();
do_lots_of_stuff(id);
}
} Even
Evenin inthis
thiscase,
case,the
thesystem
systemmay
maygive
giveyou
youfewer
fewer
threads
threadsthan
thanrequested.
requested. If Ifthe
theprecise
precise# #of
ofthreads
threads
matters,
matters,test
testfor
forititand
andrespond
respondaccordingly.
accordingly. 44
Environment Variables
z Set the default number of threads to use.
– OMP_NUM_THREADS int_literal
z Control how “omp for schedule(RUNTIME)”
loop iterations are scheduled.
– OMP_SCHEDULE “schedule[, chunk_size]”

… Plus several less commonly used environment variables.

45
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
46
Data environment:
Default storage attributes
z Shared Memory programming model:
– Most variables are shared by default

z Global variables are SHARED among threads

– Fortran: COMMON blocks, SAVE variables, MODULE
variables
– C: File scope variables, static
– Both: dynamically allocated memory (ALLOCATE, malloc, new)

z But not everything is shared...

– Stack variables in subprograms(Fortran) or functions(C) called
from parallel regions are PRIVATE
– Automatic variables within a statement block are PRIVATE.

47
Data sharing: Examples

double A[10]; extern double A[10];

int main() { void work(int *index) {
int index[10]; double temp[10];
#pragma omp parallel static int count;
work(index); ...
printf(“%d\n”, index[0]); }
}

A, index, count
A,
A,index
indexand
andcount
countare
are
shared
sharedby
byall
allthreads.
threads. temp temp temp
temp
tempis
islocal
local to
toeach
each
thread
thread
A, index, count
48
* Third party trademarks and names are the property of their respective owner.
Data sharing:
Changing storage attributes
z One can selectively change storage attributes for
constructs using the following clauses*
– SHARED All
Allthe
the clauses
clausesononthis
thispage
page
– PRIVATE apply
applytotothe
theOpenMP
OpenMPconstruct
construct
– FIRSTPRIVATE NOT
NOTto tothe
theentire
entireregion.
region.
z The final value of a private inside a parallel loop can be
transmitted to the shared variable outside the loop with:
– LASTPRIVATE
z The default attributes can be overridden with:
– DEFAULT (PRIVATE | SHARED | NONE)
DEFAULT(PRIVATE) is Fortran only

All data clauses apply to parallel constructs and worksharing constructs except
“shared” which only applies to parallel constructs. 49
Data Sharing: Private Clause
z private(var) creates a new local copy of var for each thread.
– The value is uninitialized
– In OpenMP 2.5 the value of the shared variable is undefined after
the region
void wrong() {
int tmp = 0;
#pragma omp for private(tmp)
for (int j = 0; j < 1000; ++j)
tmp
tmpwaswasnot
not
tmp += j;
printf(“%d\n”, tmp);
initialized
initialized
}

tmp:
tmp:00in
in3.0,
3.0,
unspecified
unspecifiedin
in2.5
2.5
50
Data Sharing: Private Clause
When is the original variable valid?
z The original variable’s value is unspecified in OpenMP 2.5.
z In OpenMP 3.0, if it is referenced outside of the construct
– Implementations may reference the original variable or a copy …..
A dangerous programming practice!

int tmp;
void danger() { extern int tmp;
tmp = 0; void work() {
#pragma omp parallel private(tmp) tmp = 5;
work(); }
printf(“%d\n”, tmp);
}
unspecified
unspecifiedwhich
which
tmp
tmphas
hasunspecified
unspecified copy
copyof
oftmp
tmp
value
value 51
Data Sharing: Firstprivate Clause
z Firstprivate is a special case of private.
– Initializes each private copy with the corresponding
value from the master thread.

void useless() {
int tmp = 0;
#pragma omp for firstprivate(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j; Each
Eachthread
threadgets
getsits
itsown
own
printf(“%d\n”, tmp); tmp
tmpwith
withan
aninitial
initialvalue
valueof
of00
}

tmp:
tmp:00in
in3.0,
3.0,unspecified
unspecifiedin
in2.5
2.5
52
Data sharing: Lastprivate Clause
z Lastprivate passes the value of a private from the
last iteration to a global variable.

void closer() {
int tmp = 0;
#pragma omp parallel for firstprivate(tmp) \
lastprivate(tmp)
for (int j = 0; j < 1000; ++j) Each
tmp += j;
Eachthread
threadgets
getsits
itsown
owntmp
tmp
printf(“%d\n”, tmp);
with
withan
aninitial
initialvalue
valueofof00
}

tmp
tmpisisdefined
definedas asits
itsvalue
valueatatthe
the“last
“last
sequential”
sequential”iteration
iteration(i.e.,
(i.e.,for
forj=999)
j=999) 53
Data Sharing:
A data environment test
z Consider this example of PRIVATE and FIRSTPRIVATE

variables A,B, and C = 1

#pragma omp parallel private(B) firstprivate(C)

z Are A,B,C local to each thread or shared inside the parallel region?
z What are their initial values inside and values after the parallel region?

Inside this parallel region ...

z “A” is shared by all threads; equals 1
z “B” and “C” are local to each thread.
– B’s initial value is undefined
– C’s initial value equals 1
Outside this parallel region ...
z The values of “B” and “C” are unspecified in OpenMP 2.5, and in
OpenMP 3.0 if referenced in the region but outside the construct.
54
Data Sharing: Default Clause
z Note that the default storage attribute is DEFAULT(SHARED) (so
no need to use it)
Exception: #pragma omp task
z To change default: DEFAULT(PRIVATE)
each variable in the construct is made private as if specified in a
private clause
mostly saves typing

z DEFAULT(NONE): no default for variables in static extent. Must

list storage attribute for each variable in static extent. Good
programming practice!

Only the Fortran API supports default(private).

C/C++ only has default(shared) or default(none).
55
Data Sharing: Default Clause Example

itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads()
each = itotal/np
These two
………
C$OMP END PARALLEL
code
fragments are
equivalent
itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
56
3.0

Data Sharing: tasks (OpenMP 3.0)

z The default for tasks is usually firstprivate, because the task may
not be executed until later (and variables may have gone out of
scope).
z Variables that are shared in all constructs starting from the
innermost enclosing parallel construct are shared, because the
barrier guarantees task completion.

#pragma omp parallel shared(A) private(B)

{ A is shared
... B is firstprivate
#pragma omp task C is private
{
int C;
compute(A, B, C);
}
}
57
Data sharing: Threadprivate
z Makes global data private to a thread
Fortran: COMMON blocks
C: File scope and static variables, static class members

z Different from making them PRIVATE

with PRIVATE global variables are masked.
THREADPRIVATE preserves global scope within each
thread
z Threadprivate variables can be initialized using
COPYIN or at time of definition (using language-
defined initialization capabilities).

58
A threadprivate example (C)
Use threadprivate to create a counter for each
thread.
int
intcounter
counter==0;
0;
#pragma
#pragmaomp
ompthreadprivate(counter)
threadprivate(counter)

int
intincrement_counter()
increment_counter()
{{
counter++;
counter++;
return
return(counter);
(counter);
}}

59
Data Copying: Copyin
You initialize threadprivate data using a copyin
clause.
parameter
parameter(N=1000)
(N=1000)
common/buf/A(N)
common/buf/A(N)
!$OMP
!$OMPTHREADPRIVATE(/buf/)
THREADPRIVATE(/buf/)

CCInitialize
Initializethe
theAAarray
array
call
callinit_data(N,A)
init_data(N,A)

!$OMP
!$OMPPARALLEL
PARALLELCOPYIN(A)
COPYIN(A)

…
…Now
Noweach
eachthread
threadsees
seesthreadprivate
threadprivatearray
arrayAAinitialied
initialied
…
…to
tothe
theglobal
globalvalue
valueset
setin
inthe
thesubroutine
subroutineinit_data()
init_data()

!$OMP
!$OMPEND
ENDPARALLEL
PARALLEL

end
end 60
Data Copying: Copyprivate
Used with a single region to broadcast values of privates
from one member of a team to the rest of the team.
#include
#include<omp.h>
<omp.h>
void
voidinput_parameters
input_parameters(int,
(int,int);
int);////fetch
fetchvalues
valuesof
ofinput
inputparameters
parameters
void
voiddo_work(int,
do_work(int,int);
int);

void
voidmain()
main()
{{
int
intNsize,
Nsize,choice;
choice;

#pragma
#pragmaomp
ompparallel
parallelprivate
private(Nsize,
(Nsize,choice)
choice)
{{
#pragma
#pragmaomp
ompsingle
singlecopyprivate
copyprivate(Nsize,
(Nsize,choice)
choice)
input_parameters
input_parameters(Nsize,
(Nsize,choice);
choice);

do_work(Nsize,
do_work(Nsize,choice);
choice);
}}
}}
61
Exercise 5: Monte Carlo Calculations
Using Random numbers to solve tough problems
z Sample a problem domain to estimate areas, compute
probabilities, find optimal values, etc.
z Example: Computing π with a digital dart board:

2*r z Throw darts at the circle/square.

z Create a parallel version of this program without

changing the interfaces to functions in random.c
Thisis an exercise in modular software … why should a user
of your parallel random number generator have to know any
details of the generator or make any changes to how the
generator is called?
z Extra Credit:
Make the random number generator threadsafe.
Make your random number generator numerically correct (non-
overlapping sequences of pseudo-random numbers).

63
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
64
Sections worksharing Construct
z The Sections worksharing construct gives a
different structured block to each thread.
#pragma
#pragmaomp
ompparallel
parallel
{{
#pragma
#pragmaomp
ompsections
sections
{{
#pragma
#pragmaomp
ompsection
section
X_calculation();
X_calculation();
#pragma
#pragmaomp
ompsection
section
y_calculation();
y_calculation();
#pragma
#pragmaomp
ompsection
section
z_calculation();
z_calculation();
}}
}}

By default, there is a barrier at the end of the “omp

65
sections”. Use the “nowait” clause to turn off the barrier.
loop worksharing constructs:
The schedule clause
z The schedule clause affects how loop iterations are
mapped onto threads
schedule(static [,chunk])
– Deal-out blocks of iterations of size “chunk” to each thread.
schedule(dynamic[,chunk])
– Each thread grabs “chunk” iterations off a queue until all
iterations have been handled.
schedule(guided[,chunk])
– Threads dynamically grab blocks of iterations. The size of the
block starts large and shrinks down to size “chunk” as the
calculation proceeds.
schedule(runtime)
– Schedule and chunk size taken from the OMP_SCHEDULE
environment variable (or the runtime library … for OpenMP 3.0).
66
loop work-sharing constructs:
The schedule clause

Schedule Clause When To Use Least

Leastwork
workat
at
runtime
runtime::
scheduling
scheduling
STATIC Pre-determined and done
doneat
at
predictable by the compile-time
programmer compile-time

DYNAMIC Unpredictable, highly

variable work per Most
Mostwork
workat
at
iteration runtime
runtime::
complex
complex
GUIDED Special case of dynamic
scheduling
to reduce scheduling scheduling
logic
logicused
usedat
at
overhead run-time
run-time

67
Exercise 6: hard
z Consider the program linked.c
Traverses a linked list computing a sequence of
Fibonacci numbers at each node.
z Parallelize this program using constructs
defined in OpenMP 2.5 (loop worksharing
constructs).
z Once you have a correct program, optimize it.

68
Exercise 6: easy
z Parallelize the matrix multiplication program in
the file matmul.c
z Can you optimize the program by playing with
how the loops are scheduled?

69
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
70
OpenMP memory model
z OpenMP supports a shared memory model.
z All threads share an address space, but it can get complicated:

a Shared memory

a cache1 cache2 cache3 cacheN

...
proc1 proc2 proc3 procN
a
z A memory model is defined in terms of:
Coherence: Behavior of the memory system when a single
address is accessed by multiple threads.
Consistency: Orderings of accesses to different addresses by
multiple threads. 71
OpenMP Memory Model: Basic Terms
Program order

Wa Wb Ra Rb . . .
Source code

compiler

Code order RW’s in any

Executable code Wb Rb Wa Ra . . .
semantically
equivalent order

thread thread

private view private view

a b threadprivate b a threadprivate

memory
a b 72
Commit order
Consistency: Memory Access Re-ordering

z Re-ordering:
Compiler re-orders program order to the code order
Machine re-orders code order to the memory
commit order
z At a given point in time, the temporary view of
memory may vary from shared memory.
z Consistency models based on orderings of
Reads (R), Writes (W) and Synchronizations (S):
R→R, W→W, R→W, R→S, S→S, W→S

73
Consistency
z Sequential Consistency:
Ina multi-processor, ops (R, W, S) are sequentially
consistent if:
– They remain in program order for each
processor.
– They are seen to be in the same overall order by
each of the other processors.
Program order = code order = commit order

z Relaxed consistency:
Remove some of the ordering constraints for
memory ops (R, W, S).
74
OpenMP and Relaxed Consistency

z OpenMP defines consistency as a variant of

weak consistency:
S ops must be in sequential order across threads.
Can not reorder S ops with R or W ops on the same
thread
– Weak consistency guarantees
S→W, S→R , R→S, W→S, S→S

z The Synchronization operation relevant to this

discussion is flush.

75
Flush
z Defines a sequence point at which a thread is
guaranteed to see a consistent view of memory with
respect to the “flush set”.
z The flush set is:
“all thread visible variables” for a flush construct without an
argument list.
a list of variables when the “flush(list)” construct is used.

z The action of Flush is to guarantee that:

– All R,W ops that overlap the flush set and occur prior to the
flush complete before the flush executes
– All R,W ops that overlap the flush set and occur after the
flush don’t execute until after the flush.
– Flushes with overlapping flush sets can not be reordered.

Memory ops: R = Read, W = write, S = synchronization

76
Synchronization: flush example
z Flush forces data to be updated in memory so other
threads see the most recent value

double A;
A = compute();
flush(A); // flush to memory to make sure other
// threads can pick up the right value
Note:
Note:OpenMP’s
OpenMP’sflush
flushis
isanalogous
analogousto
toaafence
fencein
in
other
othershared
sharedmemory
memoryAPI’s.
API’s.

77
Exercise 7: producer consumer
z Parallelize the “prod_cons.c” program.
z This is a well known pattern called the
producer consumer pattern
One thread produces values that another thread
consumes.
Often used with a stream of produced values to
implement “pipeline parallelism”
z The key is to implement pairwise
synchronization between threads.

78
Exercise 7: prod_cons.c
int main() I need to put the
{ prod/cons pair
double *A, sum, runtime; int flag = 0; inside a loop so its
A = (double *)malloc(N*sizeof(double));
true pipeline
parallelism.
runtime = omp_get_wtime();

fill_rand(N, A); // Producer: fill an array of data

sum = Sum_array(N, A); // Consumer: sum the array

runtime = omp_get_wtime() - runtime;

printf(" In %lf seconds, The sum is %lf \n",runtime,sum);

}
79
What is the Big Deal with Flush?
z Compilers routinely reorder instructions implementing
a program
Thishelps better exploit the functional units, keep machine
busy, hide memory latencies, etc.
z Compiler generally cannot move instructions:
past a barrier
past a flush on all variables

z But it can move them past a flush with a list of

variables so long as those variables are not accessed
z Keeping track of consistency when flushes are used
can be confusing … especially if “flush(list)” is used.

Note: the flush operation does not actually synchronize

different threads. It just ensures that a thread’s values
are made consistent with main memory.
80
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
81
OpenMP pre-history
z OpenMP based upon SMP directive
standardization efforts PCF and aborted ANSI
X3H5 – late 80’s
Nobody fully implemented either standard
Only a couple of partial implementations

z Vendors considered proprietary API’s to be a

competitive feature:
Every vendor had proprietary directives sets
Even KAP, a “portable” multi-platform parallelization
tool used different directives on each platform
PCF – Parallel computing forum KAP – parallelization tool from KAI.
82
History of OpenMP DEC

SGI Merged, HP
needed
commonality
across IBM
Cray products
Intel
Wrote a
ISV - needed rough draft Other vendors
KAI larger market straw man invited to join
SMP API

was tired of
recoding for
ASCI SMPs. Urged
vendors to
standardize. 1997
83
OpenMP Release History
A single
specification
1998 2002 for Fortran, C
and C++
OpenMP OpenMP
C/C++ 2.0 2005 2008
C/C++ 1.0
OpenMP OpenMP
2.5 3.0

OpenMP OpenMP OpenMP

Fortran 1.0 Fortran 1.1 Fortran 2.0
tasking,
1997 1999 2000 other new
features

84
3.0

Tasks
z Adding tasking is the biggest addition for 3.0

z Worked on by a separate subcommittee

led by Jay Hoeflinger at Intel

z Re-examined issue from ground up

quite different from Intel taskq’s

85
3.0

General task characteristics

z A task has
Code to execute
A data environment (it owns its data)
An assigned thread that executes the code and
uses the data
z Two activities: packaging and execution
Each encountering thread packages a new instance
of a task (code and data)
Some thread in the team executes the task at some
later time

86
3.0

Definitions
z Task construct – task directive plus structured
block
z Task – the package of code and instructions
for allocating data created when a thread
encounters a task construct
z Task region – the dynamic sequence of
instructions produced by the execution of a
task by a thread

87
3.0

Tasks and OpenMP

z Tasks have been fully integrated into OpenMP
z Key concept: OpenMP has always had tasks, we just
never called them that.
Thread encountering parallel construct packages
up a set of implicit tasks, one per thread.
Team of threads is created.
Each thread in team is assigned to one of the tasks
(and tied to it).
Barrier holds original master thread until all implicit
tasks are finished.
z We have simply added a way to create a task explicitly
for the team to execute.
z Every part of an OpenMP program is part of one task or
another!
88
3.0

task Construct
#pragma omp task [clause[[,]clause] ...]
structured-block

where clause can be one of:

if (expression)
untied
shared (list)
private (list)
firstprivate (list)
default( shared | none )

89
3.0

The if clause
z When the if clause argument is false
The task is executed immediately by the encountering
thread.
The data environment is still local to the new task...
...and it’s still a different task with respect to
synchronization.

z It’s a user directed optimization

when the cost of deferring the task is too great
compared to the cost of executing the task code
to control cache and memory affinity

90
3.0

When/where are tasks complete?

z At thread barriers, explicit or implicit

applies to all tasks generated in the current parallel
region up to the barrier
matches user expectation

z At task barriers
i.e.
Wait until all tasks defined in the current task have
completed.
#pragma omp taskwait
Note: applies only to tasks generated in the current task,
not to “descendants” .

91
3.0

Example – parallel pointer chasing

using tasks
#pragma omp parallel
{
#pragma omp single private(p)
{
p = listhead ; p is firstprivate inside
while (p) { this task
#pragma omp task
process (p)
p=next (p) ;
}
}
}

92
3.0

Example – parallel pointer chasing on

multiple lists using tasks
#pragma omp parallel
{
#pragma omp for private(p)
for ( int i =0; i <numlists ; i++) {
p = listheads [ i ] ;
while (p ) {
#pragma omp task
process (p)
p=next (p ) ;
}
}
}
93
3.0

Example: postorder tree traversal

void postorder(node *p) {

if (p->left)
#pragma omp task
postorder(p->left);
if (p->right)
#pragma omp task
postorder(p->right);
#pragma omp taskwait // wait for descendants
process(p->data);
Task scheduling point
}

z Parent task suspended until children tasks complete

94
3.0

Task switching
z Certain constructs have task scheduling points
at defined locations within them
z When a thread encounters a task scheduling
point, it is allowed to suspend the current task
and execute another (called task switching)
z It can then return to the original task and
resume

95
3.0

Task switching example

#pragma omp single

{
for (i=0; i<ONEZILLION; i++)
#pragma omp task
process(item[i]);
}

z Too many tasks generated in an eye-blink

z Generating task will have to suspend for a while
z With task switching, the executing thread can:
execute an already generated task (draining the
“task pool”)
dive into the encountered task (could be very
cache-friendly)
96
3.0

Thread switching
#pragma omp single
{
#pragma omp task untied
for (i=0; i<ONEZILLION; i++)
#pragma omp task
process(item[i]);
}

z Eventually, too many tasks are generated

z Generating task is suspended and executing thread switches to a
long and boring task
z Other threads get rid of all already generated tasks, and start
starving…

z With thread switching, the generating task can be resumed by a

different thread, and starvation is over
z Too strange to be the default: the programmer is responsible!
97
3.0

Dealing with taskprivate data

z The Taskprivate directive was removed from

OpenMP 3.0
Too expensive to implement
z Restrictions on task scheduling allow
threadprivate data to be used
User can avoid thread switching with tied tasks
Task scheduling points are well defined

98
3.0

Conclusions on tasks
z Enormous amount of work by many people

z Tightly integrated into 3.0 spec

z Flexible model for irregular parallelism

z Provides balanced solution despite often conflicting

goals

z Appears that performance can be reasonable

99
3.0

Nested parallelism
z Better support for nested parallelism
z Per-thread internal control variables
Allows,for example, calling
omp_set_num_threads() inside a parallel region.
Controls the team sizes for next level of parallelism

z Library routines to determine depth of nesting,

IDs of parent/grandparent etc. threads, team
sizes of parent/grandparent etc. teams
omp_get_active_level()
omp_get_ancestor(level)
omp_get_teamsize(level)
100
3.0

Parallel loops
z Guarantee that this works … i.e. that the same
schedule is used in the two loops:

!$omp do schedule(static)
do i=1,n
a(i) = ....
end do
!$omp end do nowait
!$omp do schedule(static)
do i=1,n
.... = a(i)
end do

101
3.0

Loops (cont.)
z Allow collapsing of perfectly nested loops

!$omp parallel do collapse(2)

do i=1,n
do j=1,n
.....
end do
end do

z Will form a single loop and then parallelize that

102
3.0

Loops (cont.)

z Made schedule(runtime) more useful

can get/set it with library routines
omp_set_schedule()
omp_get_schedule()
allow implementations to implement their own
schedule kinds
z Added a new schedule kind AUTO which gives
full freedom to the runtime to determine the
scheduling of iterations to threads.
z Allowed C++ Random access iterators as loop
control variables in parallel loops
103
3.0

Portable control of threads

z Added environment variable to control the size
of child threads’ stack
OMP_STACKSIZE
z Added environment variable to hint to runtime
how to treat idle threads
OMP_WAIT_POLICY
ACTIVE keep threads alive at barriers/locks
PASSIVE try to release processor at barriers/locks

104
3.0

Control program execution

z Added environment variable and runtime
routines to get/set the maximum number of
active levels of nested parallelism
OMP_MAX_ACTIVE_LEVELS
omp_set_max_active_levels()
omp_get_max_active_levels()
z Added environment variable to set maximum
number of threads in use
OMP_THREAD_LIMIT
omp_get_thread_limit()

105
3.0

Odds and ends

z Allow unsigned ints in parallel for loops
z Disallow use of the original variable as master thread’s
private variable
z Make it clearer where/how private objects are
constructed/destructed
z Relax some restrictions on allocatable arrays and
Fortran pointers
z Plug some minor gaps in memory model
z Allow C++ static class members to be threadprivate
z Improve C/C++ grammar
z Minor fixes and clarifications to 2.5

106
Exercise 8: tasks in OpenMP
z Consider the program linked.c
Traverses a linked list computing a sequence of
Fibonacci numbers at each node.
z Parallelize this program using tasks.
z Compare your solution’s complexity compared
to the approach without tasks.

107
Conclusion
z OpenMP 3.0 is a major upgrade … expands the
range of algorithms accessible from OpenMP.
z OpenMP is fun and about “as easy as we can
make it” for applications programmers working
with shared memory platforms.

108
OpenMP Organizations
z OpenMP architecture review board URL,
the “owner” of the OpenMP specification:
www.openmp.org
z OpenMP User’s Group (cOMPunity) URL:
www.compunity.org

Get
Get involved,
involved, join
join compunity
compunity and
and
help
help define
define the
the future
future of
of OpenMP
OpenMP
109
Books about OpenMP

z A new book about z A book about how to

OpenMP 2.5 by a team of “think parallel” with
authors at the forefront of examples in OpenMP, MPI
OpenMP’s evolution. and java 110
OpenMP Papers
z Sosa CP, Scalmani C, Gomperts R, Frisch MJ. Ab initio quantum chemistry on a
ccNUMA architecture using OpenMP. III. Parallel Computing, vol.26, no.7-8, July
2000, pp.843-56. Publisher: Elsevier, Netherlands.
z Couturier R, Chipot C. Parallel molecular dynamics using OPENMP on a shared
memory machine. Computer Physics Communications, vol.124, no.1, Jan. 2000,
pp.49-59. Publisher: Elsevier, Netherlands.
z Bentz J., Kendall R., “Parallelization of General Matrix Multiply Routines Using
OpenMP”, Shared Memory Parallel Programming with OpenMP, Lecture notes in
Computer Science, Vol. 3349, P. 1, 2005
z Bova SW, Breshearsz CP, Cuicchi CE, Demirbilek Z, Gabb HA. Dual-level parallel
analysis of harbor wave response using MPI and OpenMP. International Journal
of High Performance Computing Applications, vol.14, no.1, Spring 2000, pp.49-64.
Publisher: Sage Science Press, USA.
z Ayguade E, Martorell X, Labarta J, Gonzalez M, Navarro N. Exploiting multiple
levels of parallelism in OpenMP: a case study. Proceedings of the 1999
International Conference on Parallel Processing. IEEE Comput. Soc. 1999,
pp.172-80. Los Alamitos, CA, USA.
z Bova SW, Breshears CP, Cuicchi C, Demirbilek Z, Gabb H. Nesting OpenMP in
an MPI application. Proceedings of the ISCA 12th International Conference.
Parallel and Distributed Systems. ISCA. 1999, pp.566-71. Cary, NC, USA.
111
OpenMP Papers (continued)
z Jost G., Labarta J., Gimenez J., What Multilevel Parallel Programs do when you
are not watching: a Performance analysis case study comparing MPI/OpenMP,
MLP, and Nested OpenMP, Shared Memory Parallel Programming with OpenMP,
Lecture notes in Computer Science, Vol. 3349, P. 29, 2005
z Gonzalez M, Serra A, Martorell X, Oliver J, Ayguade E, Labarta J, Navarro N.
Applying interposition techniques for performance analysis of OPENMP parallel
applications. Proceedings 14th International Parallel and Distributed Processing
Symposium. IPDPS 2000. IEEE Comput. Soc. 2000, pp.235-40.
z Chapman B, Mehrotra P, Zima H. Enhancing OpenMP with features for locality
control. Proceedings of Eighth ECMWF Workshop on the Use of Parallel
Processors in Meteorology. Towards Teracomputing. World Scientific Publishing.
1999, pp.301-13. Singapore.
z Steve W. Bova, Clay P. Breshears, Henry Gabb, Rudolf Eigenmann, Greg Gaertner,
Bob Kuhn, Bill Magro, Stefano Salvini. Parallel Programming with Message
Passing and Directives; SIAM News, Volume 32, No 9, Nov. 1999.
z Cappello F, Richard O, Etiemble D. Performance of the NAS benchmarks on a
cluster of SMP PCs using a parallelization of the MPI programs with OpenMP.
Lecture Notes in Computer Science Vol.1662. Springer-Verlag. 1999, pp.339-50.
z Liu Z., Huang L., Chapman B., Weng T., Efficient Implementationi of OpenMP for
Clusters with Implicit Data Distribution, Shared Memory Parallel Programming with
OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 121, 2005

112
OpenMP Papers (continued)
z B. Chapman, F. Bregier, A. Patil, A. Prabhakar, “Achieving
performance under OpenMP on ccNUMA and software distributed
shared memory systems,” Concurrency and Computation: Practice and
Experience. 14(8-9): 713-739, 2002.
z J. M. Bull and M. E. Kambites. JOMP: an OpenMP-like interface for
Java. Proceedings of the ACM 2000 conference on Java Grande, 2000,
Pages 44 - 53.
z L. Adhianto and B. Chapman, “Performance modeling of
communication and computation in hybrid MPI and OpenMP
applications, Simulation Modeling Practice and Theory, vol 15, p. 481-
491, 2007.
z Shah S, Haab G, Petersen P, Throop J. Flexible control structures for
parallelism in OpenMP; Concurrency: Practice and Experience, 2000;
12:1219-1239. Publisher John Wiley & Sons, Ltd.
z Mattson, T.G., How Good is OpenMP? Scientific Programming, Vol. 11,
Number 2, p.81-93, 2003.
z Duran A., Silvera R., Corbalan J., Labarta J., “Runtime Adjustment of
Parallel Nested Loops”, Shared Memory Parallel Programming with
OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 137, 2005
113
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks

114
Exercise 1: Solution
A multi-threaded “Hello world” program
z Write a multithreaded program where each
thread prints “hello world”.
#include “omp.h” OpenMP
OpenMP include
include file
file
#include “omp.h”
void
void main()
main() Parallel region with default
Parallel region with default Sample Output:
{{ number
numberof ofthreads
threads
Sample Output:
#pragma
#pragmaomp ompparallel
parallel hello(1)
hello(1) hello(0)
hello(0) world(1)
world(1)
{{ world(0)
world(0)
int
intID
ID==omp_get_thread_num();
omp_get_thread_num(); hello hello(3)(3)hello(2)
hello(2)world(3)
world(3)
printf(“
printf(“hello(%d)
hello(%d)”,”,ID); ID);
printf(“
printf(“world(%d)
world(%d)\n”, \n”,ID);
ID); world(2)
world(2)
}} Runtime
Runtimelibrary
libraryfunction
functionto
to
}} End of the Parallel region
End of the Parallel region return a thread ID.
return a thread ID.
115
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks

116
The SPMD pattern
z The most common approach for parallel
algorithms is the SPMD or Single Program
Multiple Data pattern.
z Each thread runs the same program (Single
Program), but using the thread ID, they operate
on different data (Multiple Data) or take slightly
different paths through the code.
z In OpenMP this means:
A parallel region “near the top of the code”.
Pick up thread ID and num_threads.
Use them to split up loops and select different blocks
of data to work on.
117
Exercise 2: A simple SPMD pi program
Promote scalar to an
#include <omp.h> Promote scalar to an
array
arraydimensioned
dimensionedby by
static long num_steps = 100000; double step; number of threads
number of threads to to
#define NUM_THREADS 2 avoid
avoidraceracecondition.
condition.
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{ Only one thread should copy
int i, id,nthrds; Only one thread should copy
the
the number of threads tothe
number of threads to the
double x; global value to make
global value to make suresure
id = omp_get_thread_num(); multiple
multiplethreads
threadswriting
writingto tothe
the
same address don’t conflict.
nthrds = omp_get_num_threads(); same address don’t conflict.
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step; This
Thisisisaacommon
commontrick trickinin
sum[id] += 4.0/(1.0+x*x); SPMD programs to
SPMD programs to create create
} aacyclic
cyclicdistribution
distributionof ofloop
loop
iterations
} iterations
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
}
118
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks

119
False sharing
z If independent data elements happen to sit on the same
cache line, each update will cause the cache lines to
“slosh back and forth” between threads.
This is called “false sharing”.
z If you promote scalars to an array to support creation
of an SPMD program, the array elements are
contiguous in memory and hence share cache lines.
Result … poor scalability
z Solution:
When updates to an item are frequent, work with local copies
of data instead of an array indexed by the thread ID.
Pad arrays so elements you use are on distinct cache lines.

120
Exercise 3: SPMD Pi without false sharing
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ double pi; step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel Create
Createaascalar
scalarlocal
local
{ to each thread to
to each thread to
int i, id,nthrds; double x, sum; accumulate
accumulatepartial
partial
id = omp_get_thread_num(); sums.
sums.
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for (i=id, sum=0.0;i< num_steps; i=i+nthreads){ No
Noarray,
array,so
so
x = (i+0.5)*step; no false
no false
sum += 4.0/(1.0+x*x); sharing.
sharing.
}
#pragma omp critical Sum
Sumgoes “outof
goes“out ofscope”
scope”beyond
beyondthe the
pi += sum * step; parallel
parallelregion
region… …sosoyou
youmust
mustsum
sumititin
in
here.
here. Must protect summation into piin
Must protect summation into pi in
}
aacritical region so updates don’t conflict
critical region so updates don’t conflict
}
121
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks

122
Exercise 4: solution
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
For
Forgood
goodOpenMP
OpenMP
void main () implementations,
implementations,
{ int i; double x, pi, sum = 0.0; reduction
reductionisismore
more
scalable than critical.
scalable than critical.
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
i iprivate
private
bybydefault
default
sum = sum + 4.0/(1.0+x*x);
}
Note:
Note:we
wecreated
createdaaparallel
parallel
pi = step * sum; program without changing
program without changing
} any
anycode
codeand
andbybyadding
adding44
simple
simplelines!
lines! 123
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Monte Carlo Pi and random numbers
z Exercise 6: hard, linked lists without tasks
z Exercise 6: easy, matrix multiplication
z Exercise 7: Producer-consumer
z Exercise 8: linked lists with tasks

124
Computers and random numbers
z We use “dice” to make random numbers:
Given previous values, you cannot predict the next value.
There are no patterns in the series … and it goes on forever.

z Computers are deterministic machines … set an initial

state, run a sequence of predefined instructions, and
you get a deterministic answer
Bydesign, computers are not random and cannot produce
random numbers.
z However, with some very clever programming, we can
make “pseudo random” numbers that are as random as
you need them to be … but only if you are very careful.
z Why do I care? Random numbers drive statistical
methods used in countless applications:
Samplea large space of alternatives to find statistically good
answers (Monte Carlo methods).

125
Monte Carlo Calculations:
Using Random numbers to solve tough problems
z Sample a problem domain to estimate areas, compute
probabilities, find optimal values, etc.
z Example: Computing π with a digital dart board:

2*r z Throw darts at the circle/square.

z Chance of falling in circle is
proportional to ratio of areas:
A c = r2 * π
As = (2*r) * (2*r) = 4 * r2
P = Ac/As = π /4
z Compute π by randomly choosing
points, count the fraction that falls in
the circle, compute pi.
N=
N=10
10 ππ==2.8
2.8
N=100
N=100 ππ==3.16
3.16
N=
N=1000
1000 ππ==3.148
3.148
126
Parallel Programmers love Monte Carlo
algorithms Embarrassingly parallel: the
parallelism is so easy its
#include “omp.h” embarrassing.
static long num_trials = 10000; Add two lines and you have a
int main () parallel program.
{
long i; long Ncirc = 0; double pi, x, y;
double r = 1.0; // radius of circle. Side of squrare is 2*r
seed(0,-r, r); // The circle and square are centered at the origin
#pragma omp parallel for private (x, y) reduction (+:Ncirc)
for(i=0;i<num_trials; i++)
{
x = random(); y = random();
if ( x*x + y*y) <= r*r) Ncirc++;
}

pi = 4.0 * ((double)Ncirc/(double)num_trials);
printf("\n %d trials, pi is %f \n",num_trials, pi);
}

127
Linear Congruential Generator (LCG)
z LCG: Easy to write, cheap to compute, portable, OK quality

random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;

random_last = random_next;

z If you pick the multiplier and addend correctly, LCG has a

period of PMOD.
z Picking good LCG parameters is complicated, so look it up
(Numerical Recipes is a good source). I used the following:
MULTIPLIER = 1366
ADDEND = 150889
PMOD = 714025

128
LCG code
static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0; Seed the pseudo random
double random () sequence by setting
{ random_last
long random_next;

random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;

random_last = random_next;

return ((double)random_next/(double)PMOD);
}

129
Running the PI_MC program with LCG generator

Log10 number of samples

1
Log 10 Relative error

1 2 3 4 5 6
Run
Runthe
thesame
same
0.1
LCG - one thread program
programthethe
same
sameway
wayandand
0.01 LCG, 4 threads, get
getdifferent
different
trail 1 answers!
answers!
LCG 4 threads,
0.001 trial 2 That
Thatis
isnot
not
LCG, 4 threads, acceptable!
acceptable!
trial 3
0.0001 Issue:
Issue:mymyLCG
LCG
generator
generatoris
is
not threadsafe
not threadsafe
0.00001

Program written using the Intel C/C++ compiler (10.0.659.2005) in Microsoft Visual studio 2005 (8.0.50727.42) and running on a dual-core
laptop (Intel T2400 @ 1.83 Ghz with 2 GB RAM) running Microsoft Windows XP. 130
LCG code: threadsafe version
static long MULTIPLIER = 1366; random_last carries
static long ADDEND = 150889; state between random
static long PMOD = 714025; number computations,
long random_last = 0;
To make the generator
#pragma omp threadprivate(random_last)
threadsafe, make
double random ()
{
random_last
long random_next; threadprivate so each
thread has its own copy.
random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;
random_last = random_next;

return ((double)random_next/(double)PMOD);
}

131
Thread safe random number generators
Log10 number of samples
Thread safe
version gives the
1 same answer
1 2 3 4 5 6 LCG - one each time you
Log10 Relative error

0.1
thread run the program.
LCG 4 threads,
trial 1 But for large
0.01 number of
LCT 4 threads,
trial 2 samples, its
0.001 quality is lower
LCG 4 threads,
trial 3 than the one
thread result!
0.0001 LCG 4 threads,
thread safe Why?
0.00001

132
Pseudo Random Sequences
z Random number Generators (RNGs) define a sequence of pseudo-random
numbers of length equal to the period of the RNG

z In a typical problem, you grab a subsequence of the RNG range

Seed determines starting point

z Grab arbitrary seeds and you may generate overlapping sequences

E.g. three sequences … last one wraps at the end of the RNG period.

Thread 1
Thread 2
Thread 3

z Overlapping sequences = over-sampling and bad statistics … lower

quality or even wrong answers! 133
Parallel random number generators
z Multiple threads cooperate to generate and use
random numbers.
z Solutions:
Replicate and Pray
Give each thread a separate, independent
generator If done right,
Have one thread generate all the numbers. can generate the
same sequence
Leapfrog … deal out sequence values “round
robin” as if dealing a deck of cards. regardless of
the number of
Block method … pick your seed so each threads …
threads gets a distinct contiguous block.
Nice for
z Other than “replicate and pray”, these are difficult
to implement. Be smart … buy a math library that debugging, but
does it right. not really
needed
scientifically.

Intel’s Math kernel Library supports

all of these methods.
134
MKL Random number generators (RNG)
z MKL includes several families of RNGs in its vector statistics library.
z Specialized to efficiently generate vectors of random numbers

#define BLOCK 100

double buff[BLOCK];
Select type of
VSLStreamStatePtr stream; RNG and set seed
Initialize a
stream or
vslNewStream(&ran_stream, VSL_BRNG_WH, (int)seed_val);
pseudo
random
numbers vdRngUniform (VSL_METHOD_DUNIFORM_STD, stream,
BLOCK, buff, low, hi)

vslDeleteStream( &stream ); Fill buff with BLOCK pseudo rand.

nums, uniformly distributed with
values between lo and hi.
Delete the stream when you are done
135
Wichmann-Hill generators (WH)
z WH is a family of 273 parameter sets each defining a non-
overlapping and independent RNG.
z Easy to use, just make each stream threadprivate and initiate
RNG stream so each thread gets a unique WG RNG.

VSLStreamStatePtr stream;
#pragma omp threadprivate(stream)
…
vslNewStream(&ran_stream, VSL_BRNG_WH+Thrd_ID, (int)seed);

136
Independent Generator for each
thread
Log10 number of samples
Notice that
1 once you get
1 2 3 4 5 6 beyond the
high error,
Log10 Relative error

0.1 small sample

WH one count range,
thread adding threads
WH, 2 doesn’t
0.01
threads decrease
WH, 4 quality of
threads random
0.001 sampling.

0.0001

137
Leap Frog method
z Interleave samples in the sequence of pseudo random numbers:
Thread i starts at the ith number in the sequence
Stride through sequence, stride length = number of threads.
z Result … the same sequence of values regardless of the number
of threads.

#pragma omp single

{ nthreads = omp_get_num_threads();
iseed = PMOD/MULTIPLIER; // just pick a seed One thread
pseed[0] = iseed; computes offsets
mult_n = MULTIPLIER; and strided
for (i = 1; i < nthreads; ++i) multiplier
{
iseed = (unsigned long long)((MULTIPLIER * iseed) % PMOD);
pseed[i] = iseed; LCG with Addend = 0
mult_n = (mult_n * MULTIPLIER) % PMOD; just to keep things
} simple

Each thread stores offset

} starting point into its
random_last = (unsigned long long) pseed[id]; threadprivate “last random”
138
value
Same sequence with many threads.
z We can use the leapfrog method to generate the
same answer for any number of threads

Steps One thread 2 threads 4 threads

1000 3.156 3.156 3.156

10000 3.1168 3.1168 3.1168

100000 3.13964 3.13964 3.13964

1000000 3.140348 3.140348 3.140348

10000000 3.141658 3.141658 3.141658

Used the MKL library with two generator streams per computation: one for the x values (WH) and
one for the y values (WH+1). Also used the leapfrog method to deal out iterations among threads.
139
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Monte Carlo Pi and random numbers
z Exercise 6: hard, linked lists without tasks
z Exercise 6: easy, matrix multiplication
z Exercise 7: Producer-consumer
z Exercise 8: linked lists with tasks

140
Linked lists without tasks
z See the file Linked_omp25.c

while (p != NULL) {
p = p->next; Count number of items in the linked list
count++;
}
p = head;
for(i=0; i<count; i++) {
parr[i] = p;
Copy pointer to each node into an array
p = p->next;
}
#pragma omp parallel
{
#pragma omp for schedule(static,1)
for(i=0; i<count; i++) Process nodes in parallel with a for loop
processwork(parr[i]);
}
Default schedule Static,1
One Thread 48 seconds 45 seconds
Two Threads 39 seconds 28 seconds
141
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
Linked lists without tasks: C++ STL
z See the file Linked_cpp.cpp

std::vector<node *> nodelist;

for (p = head; p != NULL; p = p->next)
nodelist.push_back(p); Copy pointer to each node into an array

Count number of items in the linked list

int j = (int)nodelist.size();
#pragma omp parallel for schedule(static,1)
for (int i = 0; i < j; ++i)
processwork(nodelist[i]); Process nodes in parallel with a for loop

C++, default sched. C++, (static,1) C, (static,1)

One Thread 37 seconds 49 seconds 45 seconds
Two Threads 47 seconds 32 seconds 28 seconds
142
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Monte Carlo Pi and random numbers
z Exercise 6: hard, linked lists without tasks
z Exercise 6: easy, matrix multiplication
z Exercise 7: Producer-consumer
z Exercise 8: linked lists with tasks

143
Matrix multiplication
#pragma omp parallel for private(tmp, i, j, k)
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
tmp = 0.0;
for(k=0;k<Pdim;k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j) */
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}

•On a dual core laptop

•13.2 seconds 153 Mflops one thread
•7.5 seconds 270 Mflops two threads
144
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Monte Carlo Pi and random numbers
z Exercise 6: hard, linked lists without tasks
z Exercise 6: easy, matrix multiplication
z Exercise 7: Producer-consumer
z Exercise 8: linked lists with tasks

145
Pair wise synchronizaion in OpenMP

z OpenMP lacks synchronization constructs that

work between pairs of threads.
z When this is needed you have to build it
yourself.
z Pair wise synchronization
Use a shared flag variable
Reader spins waiting for the new flag value
Use flushes to force updates to and from memory

146
Exercise 7: producer consumer
int main()
{
double *A, sum, runtime; int numthreads, flag = 0;
A = (double *)malloc(N*sizeof(double));

#pragma omp parallel sections

{
Use flag to Signal when the
#pragma omp section “produced” value is ready
{
fill_rand(N, A); Flush forces refresh to memory.
#pragma omp flush
flag = 1; Guarantees that the other
#pragma omp flush (flag) thread sees the new value of A
}
#pragma omp section
{ Flush needed on both “reader” and
#pragma omp flush (flag) “writer” sides of the communication
while (flag != 1){
#pragma omp flush (flag)
} Notice you must put the flush inside the
#pragma omp flush while loop to make sure the updated flag
sum = Sum_array(N, A); variable is seen
}
}
} 147
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Monte Carlo Pi and random numbers
z Exercise 6: hard, linked lists without tasks
z Exercise 6: easy, matrix multiplication
z Exercise 7: Producer-consumer
z Exercise 8: linked lists with tasks

148
Linked lists with tasks (intel taskq)
z See the file Linked_intel_taskq.c

#pragma omp parallel

{
#pragma intel omp taskq
{
while (p != NULL) {
#pragma intel omp task captureprivate(p)
processwork(p);
p = p->next;
}
}
Array, Static, 1 Intel taskq
}
One Thread 45 seconds 48 seconds
Two Threads 28 seconds 30 seconds
149
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
Linked lists with tasks (OpenMP 3)
z See the file Linked_intel_taskq.c

#pragma omp parallel

{
#pragma omp single
{ Creates a task with
p=head; its own copy of “p”
while (p) { initialized to the
value of “p” when
#pragma omp task firstprivate(p)
the task is defined
processwork(p);
p = p->next;
}
}
}
150
Compiler notes: Intel on Windows
z Intel compiler:
Launch SW dev environment … on my laptop I use:
– start/intel software development tools/intel C++
compiler 10.1/C+ build environment for 32 bit
apps
cd to the directory that holds your source code
Build software for program foo.c
– icl /Qopenmp foo.c
Set number of threads environment variable
– set OMP_NUM_THREADS=4
Run your program
– foo.exe To get rid of the pwd on the
prompt, type
151
prompt = %
Compiler notes: Visual Studio
z Start “new project”
z Select win 32 console project
Set name and path
On the next panel, Click “next” instead of finish so you can
select an empty project on the following panel.
Drag and drop your source file into the source folder on the
visual studio solution explorer
Activate OpenMP
– Go to project properties/configuration
properties/C.C++/language … and activate OpenMP
z Set number of threads inside the program
z Build the project
z Run “without debug” from the debug menu.
152
Notes from the SC08 tutorial
z It seemed to go very well. We had over 50 people who stuck it out
throughout the day.
z Most people brought their laptops (only 7 loaner laptops were used). And
of those with laptops, most had preloaded an OS.
z The chaos at the beginning was much less than I expected. I had fears of
an hour or longer to get everyone setup. But thanks to PGI providing a
license key in a temp file, we were able to get everyone installed in short
order.
z Having a team of 4 (two speakers and two assistants) worked well. It
would have been messier without a hardcore compiler person such as
Larry. With dozens of different configurations, he had to do some serious
trouble-shooting to get the most difficult cases up and running.
z The exercises used early in the course were good. The ones after lunch
were too hard. I need to refine the exercise set. One idea is to for each
slot have an “easy” exercise and a “hard” exercise. This will help me
keep the group’s experience more balanced.
z Most people didn’t get the threadprivate exercise. The few people who
tried the linked-list exercise were amazingly creative … one even gettting
a single/nowait version to work.

153

CSE316 Operating Systems Practice Subjective Questions For ETE
No ratings yet
CSE316 Operating Systems Practice Subjective Questions For ETE
9 pages
Inodh Eruri In: Linkedin
No ratings yet
Inodh Eruri In: Linkedin
5 pages
OOSE Course Outline
No ratings yet
OOSE Course Outline
2 pages
Erp The Implementation Cycle PDF
No ratings yet
Erp The Implementation Cycle PDF
2 pages
Omp Hands On SC08
No ratings yet
Omp Hands On SC08
153 pages
OpenMP Tutorial
100% (1)
OpenMP Tutorial
82 pages
OMP Common Core-Voss
No ratings yet
OMP Common Core-Voss
217 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
OpenMPSlides Tamu SC PDF
No ratings yet
OpenMPSlides Tamu SC PDF
74 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
OpenMP Examples
No ratings yet
OpenMP Examples
12 pages
Omp Exercises
No ratings yet
Omp Exercises
81 pages
Beginning OpenMP
No ratings yet
Beginning OpenMP
20 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Openmp: Parallel Processing
No ratings yet
Openmp: Parallel Processing
40 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
CS-3006 8 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 8 UsingOpenMP SharedMemoryProgramming
61 pages
DS1822-Parallel Computing - Unit2
No ratings yet
DS1822-Parallel Computing - Unit2
25 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
PC File
No ratings yet
PC File
57 pages
Omp Handouts
No ratings yet
Omp Handouts
109 pages
Unit 3
No ratings yet
Unit 3
13 pages
Openmp Programming: Aiichiro Nakano
No ratings yet
Openmp Programming: Aiichiro Nakano
10 pages
OpenMP 01 Introduction
No ratings yet
OpenMP 01 Introduction
70 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
OpenMP Presentation
No ratings yet
OpenMP Presentation
51 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Mcap-Lab Manual 1
No ratings yet
Mcap-Lab Manual 1
19 pages
11-Programming With OpenMP
No ratings yet
11-Programming With OpenMP
28 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
Openmp 2pp
No ratings yet
Openmp 2pp
15 pages
Cs6801 Mcap MGM
No ratings yet
Cs6801 Mcap MGM
7 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Tutorial Presentation 8
No ratings yet
Tutorial Presentation 8
15 pages
HPC - Unit 3
No ratings yet
HPC - Unit 3
15 pages
Parallel Programming Using Openmp: Mike Bailey
No ratings yet
Parallel Programming Using Openmp: Mike Bailey
27 pages
OpenMPSlides Tamu SC
No ratings yet
OpenMPSlides Tamu SC
80 pages
PDSOpen MP
No ratings yet
PDSOpen MP
22 pages
Openmp 6pp
No ratings yet
Openmp 6pp
5 pages
Introduction To Open MP
No ratings yet
Introduction To Open MP
42 pages
09 OpenMP Intro
No ratings yet
09 OpenMP Intro
15 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
Openmp 1
No ratings yet
Openmp 1
38 pages
Open MP
No ratings yet
Open MP
30 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
OpenMP Workshop Day 1
No ratings yet
OpenMP Workshop Day 1
49 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
62 pages
PDC-Lab 21BCE10419
No ratings yet
PDC-Lab 21BCE10419
20 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
OpenMP P1
No ratings yet
OpenMP P1
32 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
Openmp
No ratings yet
Openmp
21 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
Semi Supervised
No ratings yet
Semi Supervised
13 pages
Seguridad ML
No ratings yet
Seguridad ML
7 pages
Seguridad
No ratings yet
Seguridad
29 pages
ColorImages PDF
No ratings yet
ColorImages PDF
59 pages
Sui2017 PDF
No ratings yet
Sui2017 PDF
8 pages
A Subset Feature Elimination Mechanism For Intrusion Detection System
No ratings yet
A Subset Feature Elimination Mechanism For Intrusion Detection System
10 pages
pp12 PDF
No ratings yet
pp12 PDF
70 pages
Seguridad
No ratings yet
Seguridad
6 pages
Computer Security: A Machine Learning Approach: Sandeep V. Sabnani
No ratings yet
Computer Security: A Machine Learning Approach: Sandeep V. Sabnani
62 pages
Implementation and Analysis of Combined Machine Learning Method For Intrusion Detection System
No ratings yet
Implementation and Analysis of Combined Machine Learning Method For Intrusion Detection System
10 pages
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
No ratings yet
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
381 pages
Toward Generating A New Intrusion Detection Dataset and Intrusion Traffic Characterization
No ratings yet
Toward Generating A New Intrusion Detection Dataset and Intrusion Traffic Characterization
9 pages
A Comparison of Co-Array Fortran and Openmp Fortran For SPMD Programming
No ratings yet
A Comparison of Co-Array Fortran and Openmp Fortran For SPMD Programming
20 pages
Numtech PDF
No ratings yet
Numtech PDF
167 pages
PFC SilviaPascual
No ratings yet
PFC SilviaPascual
77 pages
Num Tech
No ratings yet
Num Tech
39 pages
Numerical Recipes in F 90
No ratings yet
Numerical Recipes in F 90
20 pages
Tr17 11 Ritschel TKS
No ratings yet
Tr17 11 Ritschel TKS
53 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
No ratings yet
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
5 pages
Chatterjee L26
No ratings yet
Chatterjee L26
42 pages
Azizul Azri Bin Mustaffa - PEC12-60
No ratings yet
Azizul Azri Bin Mustaffa - PEC12-60
36 pages
IMSL Fortran
No ratings yet
IMSL Fortran
151 pages
Fortran 95 Openmp Directives
No ratings yet
Fortran 95 Openmp Directives
12 pages
Stanadyne Pump Diagnostic Troubleshooting PDF
No ratings yet
Stanadyne Pump Diagnostic Troubleshooting PDF
3 pages
Self-Hosted Kubernetes Deploying Docker Containers Locally With Minikube
No ratings yet
Self-Hosted Kubernetes Deploying Docker Containers Locally With Minikube
5 pages
Einvoice With SAP DRC Electronic Document Processing Using Extensibility Features
No ratings yet
Einvoice With SAP DRC Electronic Document Processing Using Extensibility Features
7 pages
CSE202 Mini Project Report
No ratings yet
CSE202 Mini Project Report
9 pages
AIS Question Bank Mid Exam
No ratings yet
AIS Question Bank Mid Exam
3 pages
NTR Solution Design Document v1
No ratings yet
NTR Solution Design Document v1
31 pages
3.2.1 Example - WBS, V 1.0.1
0% (1)
3.2.1 Example - WBS, V 1.0.1
3 pages
Unit 3 - Principles of Programming Languages
No ratings yet
Unit 3 - Principles of Programming Languages
12 pages
2943-Port & Valve Timing
No ratings yet
2943-Port & Valve Timing
1 page
The 8 Useful Java Testing Tools
No ratings yet
The 8 Useful Java Testing Tools
4 pages
(ISC) 2 Certified in Cybersecurity - Exam Prep Flashcards - Quizlet
No ratings yet
(ISC) 2 Certified in Cybersecurity - Exam Prep Flashcards - Quizlet
5 pages
Fuel System
100% (8)
Fuel System
99 pages
etInterview-Questions-and-Answers for-Experienced-and-Freshers
No ratings yet
etInterview-Questions-and-Answers for-Experienced-and-Freshers
6 pages
Chapathi Making Machine
No ratings yet
Chapathi Making Machine
3 pages
Sanket Suryawanshi Resume
No ratings yet
Sanket Suryawanshi Resume
2 pages
Gla University Agile Syllabus
No ratings yet
Gla University Agile Syllabus
1 page
BOM
No ratings yet
BOM
1 page
Unit 6 Algorithms. Programming Languages: Start-Up 1. Answer The Questions
No ratings yet
Unit 6 Algorithms. Programming Languages: Start-Up 1. Answer The Questions
11 pages
CMPN Iot Vi
No ratings yet
CMPN Iot Vi
67 pages
Basic Concepts - StarUML Documentation
No ratings yet
Basic Concepts - StarUML Documentation
1 page
UML Use Cases - Class Diagrams
No ratings yet
UML Use Cases - Class Diagrams
17 pages
Sans Survey Ics 2023
No ratings yet
Sans Survey Ics 2023
19 pages
Object-Oriented Analysis and Design
No ratings yet
Object-Oriented Analysis and Design
6 pages
Matriks Kompetensi BPP
No ratings yet
Matriks Kompetensi BPP
14 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Chhayank Tyagi
No ratings yet
Chhayank Tyagi
1 page