Omp Hands On SC08 PDF
Omp Hands On SC08 PDF
OpenMP*
1
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
Preliminaries: part 1
z Disclosures
The views expressed in this tutorial are those of the
people delivering the tutorial.
– We are not speaking for our employers.
– We are not speaking for the OpenMP ARB
z This is a new tutorial for us:
Help us improve … tell us how you would make this
tutorial better.
2
Preliminaries: Part 2
z Our plan for the day .. Active learning!
We will mix short lectures with short exercises.
You will use your laptop for the exercises … that
way you’ll have an OpenMP environment to take
home so you can keep learning on your own.
z Please follow these simple rules
Do the exercises we assign and then change things
around and experiment.
– Embrace active learning!
Don’t cheat: Do Not look at the solutions before
you complete an exercise … even if you get really
frustrated.
3
Our Plan for the day
Topic Exercise concepts
I. OMP Intro Install sw, Parallel regions
hello_world
II. Creating threads Pi_spmd_simple Parallel, default data
environment, runtime library
calls
Break
III. Synchronization Pi_spmd_final False sharing, critical, atomic
End User
Application
Environment
Layer
Directives,
Prog.
OpenMP library
Compiler variables
System layer
void
void main()
main()
{{
int
intID
ID==0;
0;
printf(“
printf(“hello(%d)
hello(%d)”,”,ID);
ID);
printf(“
printf(“world(%d)
world(%d)\n”,
\n”,ID);
ID);
}} 9
Exercise 1, Part B: Hello world
Verify that your OpenMP environment works
z Write a multithreaded program that prints “hello world”.
#include “omp.h”
void
void main()
main()
{{
#pragma omp parallel Switches for compiling and linking
-fopenmp gcc
{
-mp pgi
int
intID
ID==0;
0; /Qopenmp intel
printf(“
printf(“hello(%d)
hello(%d)”,”,ID);
ID);
printf(“
printf(“world(%d)
world(%d)\n”,
\n”,ID);
ID);
}
}} 10
Exercise 1: Solution
A multi-threaded “Hello world” program
z Write a multithreaded program where each
thread prints “hello world”.
#include “omp.h” OpenMP
OpenMP include
include file
file
#include “omp.h”
void
void main()
main() Parallel region with default
Parallel region with default Sample Output:
{{ number
numberof ofthreads
threads
Sample Output:
#pragma
#pragmaomp ompparallel
parallel hello(1)
hello(1) hello(0)
hello(0) world(1)
world(1)
{{ world(0)
world(0)
int
intID
ID==omp_get_thread_num();
omp_get_thread_num(); hello hello(3)(3)hello(2)
hello(2)world(3)
world(3)
printf(“
printf(“hello(%d)
hello(%d)”,”,ID); ID);
printf(“
printf(“world(%d)
world(%d)\n”, \n”,ID);
ID); world(2)
world(2)
}} Runtime
Runtimelibrary
libraryfunction
functionto
to
}} End of the Parallel region
End of the Parallel region return a thread ID.
return a thread ID.
11
OpenMP Overview:
How do threads interact?
z OpenMP is a multi-threading, shared address
model.
– Threads communicate by sharing variables.
z Unintended sharing of data causes race
conditions:
– race condition: when the program’s outcome
changes as the threads are scheduled differently.
z To control race conditions:
– Use synchronization to protect data conflicts.
z Synchronization is expensive so:
– Change how data is accessed to minimize the need
for synchronization. 12
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
13
OpenMP Programming Model:
Fork-Join Parallelism:
Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance goals
are met: i.e. the sequential program evolves into a
parallel program.
Parallel Regions AANested
Nested
Master Parallel
Parallel
Thread region
region
in red
Sequential Parts 14
Thread Creation: Parallel Regions
double A[1000];
z Each thread executes the omp_set_num_threads(4);
same code redundantly. #pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);
AAsingle
single
copy
copyofofAA
is
isshared
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between
betweenallall
threads.
threads.
printf(“all done\n”); Threads
Threadswait
wait here
here for
forall
allthreads
threadsto to
finish
finishbefore
beforeproceeding
proceeding(i.e.
(i.e.aabarrier)
barrier) 17
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
Exercises 2 to 4:
Numerical Integration
Mathematically, we know that:
1
4.0
∫
4.0
(1+x2) dx = π
0
integral as a sum of
2.0
rectangles:
N
∑ F(x )Δx ≈ π
i
i=0
1.0
Where each rectangle has
0.0
X width Δx and height F(xi) at
the middle of interval i.
18
Exercises 2 to 4: Serial PI Program
22
Synchronization: critical
z Mutual exclusion: Only one thread at a time
can enter a critical region.
float res;
#pragma omp parallel
{ float B; int i, id, nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
Threads
Threadswait
wait
their for(i=id;i<niters;i+nthrds){
theirturn
turn––
only
onlyone
oneatataa B = big_job(i);
time
timecalls
calls #pragma omp critical
consume()
consume() consume (B, res);
}
} 23
Synchronization: Atomic
z Atomic provides mutual exclusion but only
applies to the update of a memory location (the
update of X in the following example)
#pragma omp parallel
{
double tmp, B;
B = DOIT();
tmp = big_ugly(B); Atomic only protects
the read/update of X
#pragma omp atomic
X += big_ugly(B);
tmp;
}
24
Exercise 3
z In exercise 2, you probably used an array to
create space for each thread to store its partial
sum.
z If array elements happen to share a cache line,
this leads to false sharing.
– Non-shared data in the same cache line so each
update invalidates the cache line … in essence
“sloshing independent data” back and forth
between threads.
z Modify your “pi program” from exercise 2 to
avoid false sharing due to the sum array.
25
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
26
SPMD vs. worksharing
z A parallel construct by itself creates an SPMD
or “Single Program Multiple Data” program …
i.e., each thread redundantly executes the
same code.
z How do you split up pathways through the
code between threads within a team?
This is called worksharing
– Loop construct
– Sections/section constructs Discussed later
– Single construct
– Task construct …. Coming in OpenMP 3.0
27
The loop worksharing Constructs
z The loop workharing construct splits up loop
iterations among the threads in a team
#pragma omp parallel
{ Loop construct
#pragma omp for name:
for (I=0;I<N;I++){
•C/C++: for
NEAT_STUFF(I);
} •Fortran: do
}
#pragma
#pragmaompompparallel
parallel
{{
int
intid,
id,i,i,Nthrds,
Nthrds,istart,
istart,iend;
iend;
OpenMP parallel idid==omp_get_thread_num();
omp_get_thread_num();
region Nthrds
Nthrds==omp_get_num_threads();
omp_get_num_threads();
istart
istart==idid**NN//Nthrds;
Nthrds;
iend
iend==(id+1)(id+1)**NN//Nthrds;
Nthrds;
ifif(id
(id==
==Nthrds-1)iend
Nthrds-1)iend==N; N;
for(i=istart;I<iend;i++)
for(i=istart;I<iend;i++) {{a[i] a[i]==a[i]
a[i]++b[i];}
b[i];}
}}
OpenMP parallel
#pragma
#pragmaomp
ompparallel
parallel
region and a
#pragma
#pragmaomp
ompfor
for
worksharing for
for(i=0;I<N;i++)
for(i=0;I<N;i++) {{a[i]
a[i]==a[i]
a[i]++b[i];}
b[i];}
construct 29
Combined parallel/worksharing construct
These
Theseare
areequivalent
equivalent
30
Working with loops
z Basic approach
Find compute intensive loops
Make the loop iterations independent .. So they can
safely execute in any order without loop-carried
dependencies
Place the appropriate OpenMP directive and test
32
Reduction
z OpenMP reduction clause:
reduction (op : list)
z Inside a parallel or a work-sharing construct:
– A local copy of each list variable is made and initialized
depending on the “op” (e.g. 0 for “+”).
– Compiler finds standard reduction expressions containing
“op” and uses them to update the local copy.
– Local copies are reduced into a single value and
combined with the original global value.
z The variables in “list” must be shared in the enclosing
parallel region.
double ave=0.0, A[MAX]; int i;
#pragma omp parallel for reduction (+:ave)
for (i=0;i< MAX; i++) {
ave + = A[i];
}
33
ave = ave/MAX;
OpenMP: Reduction operands/initial-values
z Many different associative operands can be used with reduction:
z Initial values are the ones that make sense mathematically.
35
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
36
Synchronization: Barrier
z Barrier: Each thread waits until all threads arrive.
#pragma omp parallel shared (A, B, C) private(id)
{
id=omp_get_thread_num();
A[id] = big_calc1(id);
implicit
implicit barrier
barrier at
at the
the end
end of
of aa
#pragma omp barrier
for
forworksharing
worksharingconstruct
construct
#pragma omp for
for(i=0;i<N;i++){C[i]=big_calc3(i,A);}
#pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }
A[id] = big_calc4(id);
} no implicit barrier
implicit barrier at the end
implicit barrier at the end no implicit barrier
of due
duetotonowait
nowait37
ofaaparallel
parallelregion
region
Master Construct
z The master construct denotes a structured
block that is only executed by the master thread.
z The other threads just skip it (no
synchronization is implied).
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel private (tmp, id)
{
id = omp_get_thread_num(); Wait
Waithere
herefor
for
tmp = do_lots_of_work(id); your
yourturn.
turn.
omp_set_lock(&lck); Release
Releasethethelock
lock
printf(“%d %d”, id, tmp); so
sothe
thenext
nextthread
thread
omp_unset_lock(&lck); gets
getsaaturn.
turn.
}
omp_destroy_lock(&lck); Free-up
Free-upstorage
storagewhen
whendone.
done.
42
Runtime Library routines
z Runtime environment routines:
– Modify/Check the number of threads
– omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads()
– Are we in an active parallel region?
– omp_in_parallel()
– Do you want the system to dynamically vary the number of
threads from one parallel construct to another?
– omp_set_dynamic, omp_get_dynamic();
– How many processors in the system?
– omp_num_procs()
43
Runtime Library routines
z To use a known, fixed number of threads in a program,
(1) tell the system that you don’t want dynamic adjustment of
the number of threads, (2) set the number of threads, then (3)
save the number you got.
Disable dynamic adjustment of the
#include <omp.h> number of threads.
void main()
{ int num_threads; Request as many threads as
omp_set_dynamic( 0 ); you have processors.
omp_set_num_threads( omp_num_procs() );
#pragma omp parallel
Protect this op since Memory
{ int id=omp_get_thread_num();
stores are not atomic
#pragma omp single
num_threads = omp_get_num_threads();
do_lots_of_stuff(id);
}
} Even
Evenin inthis
thiscase,
case,the
thesystem
systemmay
maygive
giveyou
youfewer
fewer
threads
threadsthan
thanrequested.
requested. If Ifthe
theprecise
precise# #of
ofthreads
threads
matters,
matters,test
testfor
forititand
andrespond
respondaccordingly.
accordingly. 44
Environment Variables
z Set the default number of threads to use.
– OMP_NUM_THREADS int_literal
z Control how “omp for schedule(RUNTIME)”
loop iterations are scheduled.
– OMP_SCHEDULE “schedule[, chunk_size]”
45
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
46
Data environment:
Default storage attributes
z Shared Memory programming model:
– Most variables are shared by default
47
Data sharing: Examples
A, index, count
A,
A,index
indexand
andcount
countare
are
shared
sharedby
byall
allthreads.
threads. temp temp temp
temp
tempis
islocal
local to
toeach
each
thread
thread
A, index, count
48
* Third party trademarks and names are the property of their respective owner.
Data sharing:
Changing storage attributes
z One can selectively change storage attributes for
constructs using the following clauses*
– SHARED All
Allthe
the clauses
clausesononthis
thispage
page
– PRIVATE apply
applytotothe
theOpenMP
OpenMPconstruct
construct
– FIRSTPRIVATE NOT
NOTto tothe
theentire
entireregion.
region.
z The final value of a private inside a parallel loop can be
transmitted to the shared variable outside the loop with:
– LASTPRIVATE
z The default attributes can be overridden with:
– DEFAULT (PRIVATE | SHARED | NONE)
DEFAULT(PRIVATE) is Fortran only
All data clauses apply to parallel constructs and worksharing constructs except
“shared” which only applies to parallel constructs. 49
Data Sharing: Private Clause
z private(var) creates a new local copy of var for each thread.
– The value is uninitialized
– In OpenMP 2.5 the value of the shared variable is undefined after
the region
void wrong() {
int tmp = 0;
#pragma omp for private(tmp)
for (int j = 0; j < 1000; ++j)
tmp
tmpwaswasnot
not
tmp += j;
printf(“%d\n”, tmp);
initialized
initialized
}
tmp:
tmp:00in
in3.0,
3.0,
unspecified
unspecifiedin
in2.5
2.5
50
Data Sharing: Private Clause
When is the original variable valid?
z The original variable’s value is unspecified in OpenMP 2.5.
z In OpenMP 3.0, if it is referenced outside of the construct
– Implementations may reference the original variable or a copy …..
A dangerous programming practice!
int tmp;
void danger() { extern int tmp;
tmp = 0; void work() {
#pragma omp parallel private(tmp) tmp = 5;
work(); }
printf(“%d\n”, tmp);
}
unspecified
unspecifiedwhich
which
tmp
tmphas
hasunspecified
unspecified copy
copyof
oftmp
tmp
value
value 51
Data Sharing: Firstprivate Clause
z Firstprivate is a special case of private.
– Initializes each private copy with the corresponding
value from the master thread.
void useless() {
int tmp = 0;
#pragma omp for firstprivate(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j; Each
Eachthread
threadgets
getsits
itsown
own
printf(“%d\n”, tmp); tmp
tmpwith
withan
aninitial
initialvalue
valueof
of00
}
tmp:
tmp:00in
in3.0,
3.0,unspecified
unspecifiedin
in2.5
2.5
52
Data sharing: Lastprivate Clause
z Lastprivate passes the value of a private from the
last iteration to a global variable.
void closer() {
int tmp = 0;
#pragma omp parallel for firstprivate(tmp) \
lastprivate(tmp)
for (int j = 0; j < 1000; ++j) Each
tmp += j;
Eachthread
threadgets
getsits
itsown
owntmp
tmp
printf(“%d\n”, tmp);
with
withan
aninitial
initialvalue
valueofof00
}
tmp
tmpisisdefined
definedas asits
itsvalue
valueatatthe
the“last
“last
sequential”
sequential”iteration
iteration(i.e.,
(i.e.,for
forj=999)
j=999) 53
Data Sharing:
A data environment test
z Consider this example of PRIVATE and FIRSTPRIVATE
z Are A,B,C local to each thread or shared inside the parallel region?
z What are their initial values inside and values after the parallel region?
itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads()
each = itotal/np
These two
………
C$OMP END PARALLEL
code
fragments are
equivalent
itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
56
3.0
58
A threadprivate example (C)
Use threadprivate to create a counter for each
thread.
int
intcounter
counter==0;
0;
#pragma
#pragmaomp
ompthreadprivate(counter)
threadprivate(counter)
int
intincrement_counter()
increment_counter()
{{
counter++;
counter++;
return
return(counter);
(counter);
}}
59
Data Copying: Copyin
You initialize threadprivate data using a copyin
clause.
parameter
parameter(N=1000)
(N=1000)
common/buf/A(N)
common/buf/A(N)
!$OMP
!$OMPTHREADPRIVATE(/buf/)
THREADPRIVATE(/buf/)
CCInitialize
Initializethe
theAAarray
array
call
callinit_data(N,A)
init_data(N,A)
!$OMP
!$OMPPARALLEL
PARALLELCOPYIN(A)
COPYIN(A)
…
…Now
Noweach
eachthread
threadsees
seesthreadprivate
threadprivatearray
arrayAAinitialied
initialied
…
…to
tothe
theglobal
globalvalue
valueset
setin
inthe
thesubroutine
subroutineinit_data()
init_data()
!$OMP
!$OMPEND
ENDPARALLEL
PARALLEL
end
end 60
Data Copying: Copyprivate
Used with a single region to broadcast values of privates
from one member of a team to the rest of the team.
#include
#include<omp.h>
<omp.h>
void
voidinput_parameters
input_parameters(int,
(int,int);
int);////fetch
fetchvalues
valuesof
ofinput
inputparameters
parameters
void
voiddo_work(int,
do_work(int,int);
int);
void
voidmain()
main()
{{
int
intNsize,
Nsize,choice;
choice;
#pragma
#pragmaomp
ompparallel
parallelprivate
private(Nsize,
(Nsize,choice)
choice)
{{
#pragma
#pragmaomp
ompsingle
singlecopyprivate
copyprivate(Nsize,
(Nsize,choice)
choice)
input_parameters
input_parameters(Nsize,
(Nsize,choice);
choice);
do_work(Nsize,
do_work(Nsize,choice);
choice);
}}
}}
61
Exercise 5: Monte Carlo Calculations
Using Random numbers to solve tough problems
z Sample a problem domain to estimate areas, compute
probabilities, find optimal values, etc.
z Example: Computing π with a digital dart board:
63
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
64
Sections worksharing Construct
z The Sections worksharing construct gives a
different structured block to each thread.
#pragma
#pragmaomp
ompparallel
parallel
{{
#pragma
#pragmaomp
ompsections
sections
{{
#pragma
#pragmaomp
ompsection
section
X_calculation();
X_calculation();
#pragma
#pragmaomp
ompsection
section
y_calculation();
y_calculation();
#pragma
#pragmaomp
ompsection
section
z_calculation();
z_calculation();
}}
}}
67
Exercise 6: hard
z Consider the program linked.c
Traverses a linked list computing a sequence of
Fibonacci numbers at each node.
z Parallelize this program using constructs
defined in OpenMP 2.5 (loop worksharing
constructs).
z Once you have a correct program, optimize it.
68
Exercise 6: easy
z Parallelize the matrix multiplication program in
the file matmul.c
z Can you optimize the program by playing with
how the loops are scheduled?
69
Outline
z Introduction to OpenMP
z Creating Threads
z Synchronization
z Parallel Loops
z Synchronize single masters and stuff
z Data environment
z Schedule your for and sections
z Memory model
z OpenMP 3.0 and Tasks
70
OpenMP memory model
z OpenMP supports a shared memory model.
z All threads share an address space, but it can get complicated:
a Shared memory
Wa Wb Ra Rb . . .
Source code
compiler
thread thread
memory
a b 72
Commit order
Consistency: Memory Access Re-ordering
z Re-ordering:
Compiler re-orders program order to the code order
Machine re-orders code order to the memory
commit order
z At a given point in time, the temporary view of
memory may vary from shared memory.
z Consistency models based on orderings of
Reads (R), Writes (W) and Synchronizations (S):
R→R, W→W, R→W, R→S, S→S, W→S
73
Consistency
z Sequential Consistency:
Ina multi-processor, ops (R, W, S) are sequentially
consistent if:
– They remain in program order for each
processor.
– They are seen to be in the same overall order by
each of the other processors.
Program order = code order = commit order
z Relaxed consistency:
Remove some of the ordering constraints for
memory ops (R, W, S).
74
OpenMP and Relaxed Consistency
75
Flush
z Defines a sequence point at which a thread is
guaranteed to see a consistent view of memory with
respect to the “flush set”.
z The flush set is:
“all thread visible variables” for a flush construct without an
argument list.
a list of variables when the “flush(list)” construct is used.
double A;
A = compute();
flush(A); // flush to memory to make sure other
// threads can pick up the right value
Note:
Note:OpenMP’s
OpenMP’sflush
flushis
isanalogous
analogousto
toaafence
fencein
in
other
othershared
sharedmemory
memoryAPI’s.
API’s.
77
Exercise 7: producer consumer
z Parallelize the “prod_cons.c” program.
z This is a well known pattern called the
producer consumer pattern
One thread produces values that another thread
consumes.
Often used with a stream of produced values to
implement “pipeline parallelism”
z The key is to implement pairwise
synchronization between threads.
78
Exercise 7: prod_cons.c
int main() I need to put the
{ prod/cons pair
double *A, sum, runtime; int flag = 0; inside a loop so its
A = (double *)malloc(N*sizeof(double));
true pipeline
parallelism.
runtime = omp_get_wtime();
SGI Merged, HP
needed
commonality
across IBM
Cray products
Intel
Wrote a
ISV - needed rough draft Other vendors
KAI larger market straw man invited to join
SMP API
was tired of
recoding for
ASCI SMPs. Urged
vendors to
standardize. 1997
83
OpenMP Release History
A single
specification
1998 2002 for Fortran, C
and C++
OpenMP OpenMP
C/C++ 2.0 2005 2008
C/C++ 1.0
OpenMP OpenMP
2.5 3.0
84
3.0
Tasks
z Adding tasking is the biggest addition for 3.0
85
3.0
86
3.0
Definitions
z Task construct – task directive plus structured
block
z Task – the package of code and instructions
for allocating data created when a thread
encounters a task construct
z Task region – the dynamic sequence of
instructions produced by the execution of a
task by a thread
87
3.0
task Construct
#pragma omp task [clause[[,]clause] ...]
structured-block
if (expression)
untied
shared (list)
private (list)
firstprivate (list)
default( shared | none )
89
3.0
The if clause
z When the if clause argument is false
The task is executed immediately by the encountering
thread.
The data environment is still local to the new task...
...and it’s still a different task with respect to
synchronization.
90
3.0
z At task barriers
i.e.
Wait until all tasks defined in the current task have
completed.
#pragma omp taskwait
Note: applies only to tasks generated in the current task,
not to “descendants” .
91
3.0
92
3.0
94
3.0
Task switching
z Certain constructs have task scheduling points
at defined locations within them
z When a thread encounters a task scheduling
point, it is allowed to suspend the current task
and execute another (called task switching)
z It can then return to the original task and
resume
95
3.0
Thread switching
#pragma omp single
{
#pragma omp task untied
for (i=0; i<ONEZILLION; i++)
#pragma omp task
process(item[i]);
}
98
3.0
Conclusions on tasks
z Enormous amount of work by many people
99
3.0
Nested parallelism
z Better support for nested parallelism
z Per-thread internal control variables
Allows,for example, calling
omp_set_num_threads() inside a parallel region.
Controls the team sizes for next level of parallelism
Parallel loops
z Guarantee that this works … i.e. that the same
schedule is used in the two loops:
!$omp do schedule(static)
do i=1,n
a(i) = ....
end do
!$omp end do nowait
!$omp do schedule(static)
do i=1,n
.... = a(i)
end do
101
3.0
Loops (cont.)
z Allow collapsing of perfectly nested loops
Loops (cont.)
104
3.0
105
3.0
106
Exercise 8: tasks in OpenMP
z Consider the program linked.c
Traverses a linked list computing a sequence of
Fibonacci numbers at each node.
z Parallelize this program using tasks.
z Compare your solution’s complexity compared
to the approach without tasks.
107
Conclusion
z OpenMP 3.0 is a major upgrade … expands the
range of algorithms accessible from OpenMP.
z OpenMP is fun and about “as easy as we can
make it” for applications programmers working
with shared memory platforms.
108
OpenMP Organizations
z OpenMP architecture review board URL,
the “owner” of the OpenMP specification:
www.openmp.org
z OpenMP User’s Group (cOMPunity) URL:
www.compunity.org
Get
Get involved,
involved, join
join compunity
compunity and
and
help
help define
define the
the future
future of
of OpenMP
OpenMP
109
Books about OpenMP
112
OpenMP Papers (continued)
z B. Chapman, F. Bregier, A. Patil, A. Prabhakar, “Achieving
performance under OpenMP on ccNUMA and software distributed
shared memory systems,” Concurrency and Computation: Practice and
Experience. 14(8-9): 713-739, 2002.
z J. M. Bull and M. E. Kambites. JOMP: an OpenMP-like interface for
Java. Proceedings of the ACM 2000 conference on Java Grande, 2000,
Pages 44 - 53.
z L. Adhianto and B. Chapman, “Performance modeling of
communication and computation in hybrid MPI and OpenMP
applications, Simulation Modeling Practice and Theory, vol 15, p. 481-
491, 2007.
z Shah S, Haab G, Petersen P, Throop J. Flexible control structures for
parallelism in OpenMP; Concurrency: Practice and Experience, 2000;
12:1219-1239. Publisher John Wiley & Sons, Ltd.
z Mattson, T.G., How Good is OpenMP? Scientific Programming, Vol. 11,
Number 2, p.81-93, 2003.
z Duran A., Silvera R., Corbalan J., Labarta J., “Runtime Adjustment of
Parallel Nested Loops”, Shared Memory Parallel Programming with
OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 137, 2005
113
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks
114
Exercise 1: Solution
A multi-threaded “Hello world” program
z Write a multithreaded program where each
thread prints “hello world”.
#include “omp.h” OpenMP
OpenMP include
include file
file
#include “omp.h”
void
void main()
main() Parallel region with default
Parallel region with default Sample Output:
{{ number
numberof ofthreads
threads
Sample Output:
#pragma
#pragmaomp ompparallel
parallel hello(1)
hello(1) hello(0)
hello(0) world(1)
world(1)
{{ world(0)
world(0)
int
intID
ID==omp_get_thread_num();
omp_get_thread_num(); hello hello(3)(3)hello(2)
hello(2)world(3)
world(3)
printf(“
printf(“hello(%d)
hello(%d)”,”,ID); ID);
printf(“
printf(“world(%d)
world(%d)\n”, \n”,ID);
ID); world(2)
world(2)
}} Runtime
Runtimelibrary
libraryfunction
functionto
to
}} End of the Parallel region
End of the Parallel region return a thread ID.
return a thread ID.
115
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks
116
The SPMD pattern
z The most common approach for parallel
algorithms is the SPMD or Single Program
Multiple Data pattern.
z Each thread runs the same program (Single
Program), but using the thread ID, they operate
on different data (Multiple Data) or take slightly
different paths through the code.
z In OpenMP this means:
A parallel region “near the top of the code”.
Pick up thread ID and num_threads.
Use them to split up loops and select different blocks
of data to work on.
117
Exercise 2: A simple SPMD pi program
Promote scalar to an
#include <omp.h> Promote scalar to an
array
arraydimensioned
dimensionedby by
static long num_steps = 100000; double step; number of threads
number of threads to to
#define NUM_THREADS 2 avoid
avoidraceracecondition.
condition.
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{ Only one thread should copy
int i, id,nthrds; Only one thread should copy
the
the number of threads tothe
number of threads to the
double x; global value to make
global value to make suresure
id = omp_get_thread_num(); multiple
multiplethreads
threadswriting
writingto tothe
the
same address don’t conflict.
nthrds = omp_get_num_threads(); same address don’t conflict.
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step; This
Thisisisaacommon
commontrick trickinin
sum[id] += 4.0/(1.0+x*x); SPMD programs to
SPMD programs to create create
} aacyclic
cyclicdistribution
distributionof ofloop
loop
iterations
} iterations
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
}
118
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks
119
False sharing
z If independent data elements happen to sit on the same
cache line, each update will cause the cache lines to
“slosh back and forth” between threads.
This is called “false sharing”.
z If you promote scalars to an array to support creation
of an SPMD program, the array elements are
contiguous in memory and hence share cache lines.
Result … poor scalability
z Solution:
When updates to an item are frequent, work with local copies
of data instead of an array indexed by the thread ID.
Pad arrays so elements you use are on distinct cache lines.
120
Exercise 3: SPMD Pi without false sharing
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ double pi; step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel Create
Createaascalar
scalarlocal
local
{ to each thread to
to each thread to
int i, id,nthrds; double x, sum; accumulate
accumulatepartial
partial
id = omp_get_thread_num(); sums.
sums.
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for (i=id, sum=0.0;i< num_steps; i=i+nthreads){ No
Noarray,
array,so
so
x = (i+0.5)*step; no false
no false
sum += 4.0/(1.0+x*x); sharing.
sharing.
}
#pragma omp critical Sum
Sumgoes “outof
goes“out ofscope”
scope”beyond
beyondthe the
pi += sum * step; parallel
parallelregion
region… …sosoyou
youmust
mustsum
sumititin
in
here.
here. Must protect summation into piin
Must protect summation into pi in
}
aacritical region so updates don’t conflict
critical region so updates don’t conflict
}
121
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Producer-consumer
z Exercise 6: Monte Carlo Pi and random numbers
z Exercise 7: hard, linked lists without tasks
z Exercise 7: easy, matrix multiplication
z Exercise 8: linked lists with tasks
122
Exercise 4: solution
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
For
Forgood
goodOpenMP
OpenMP
void main () implementations,
implementations,
{ int i; double x, pi, sum = 0.0; reduction
reductionisismore
more
scalable than critical.
scalable than critical.
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
i iprivate
private
bybydefault
default
sum = sum + 4.0/(1.0+x*x);
}
Note:
Note:we
wecreated
createdaaparallel
parallel
pi = step * sum; program without changing
program without changing
} any
anycode
codeand
andbybyadding
adding44
simple
simplelines!
lines! 123
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Monte Carlo Pi and random numbers
z Exercise 6: hard, linked lists without tasks
z Exercise 6: easy, matrix multiplication
z Exercise 7: Producer-consumer
z Exercise 8: linked lists with tasks
124
Computers and random numbers
z We use “dice” to make random numbers:
Given previous values, you cannot predict the next value.
There are no patterns in the series … and it goes on forever.
125
Monte Carlo Calculations:
Using Random numbers to solve tough problems
z Sample a problem domain to estimate areas, compute
probabilities, find optimal values, etc.
z Example: Computing π with a digital dart board:
pi = 4.0 * ((double)Ncirc/(double)num_trials);
printf("\n %d trials, pi is %f \n",num_trials, pi);
}
127
Linear Congruential Generator (LCG)
z LCG: Easy to write, cheap to compute, portable, OK quality
128
LCG code
static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0; Seed the pseudo random
double random () sequence by setting
{ random_last
long random_next;
return ((double)random_next/(double)PMOD);
}
129
Running the PI_MC program with LCG generator
1
Log 10 Relative error
1 2 3 4 5 6
Run
Runthe
thesame
same
0.1
LCG - one thread program
programthethe
same
sameway
wayandand
0.01 LCG, 4 threads, get
getdifferent
different
trail 1 answers!
answers!
LCG 4 threads,
0.001 trial 2 That
Thatis
isnot
not
LCG, 4 threads, acceptable!
acceptable!
trial 3
0.0001 Issue:
Issue:mymyLCG
LCG
generator
generatoris
is
not threadsafe
not threadsafe
0.00001
Program written using the Intel C/C++ compiler (10.0.659.2005) in Microsoft Visual studio 2005 (8.0.50727.42) and running on a dual-core
laptop (Intel T2400 @ 1.83 Ghz with 2 GB RAM) running Microsoft Windows XP. 130
LCG code: threadsafe version
static long MULTIPLIER = 1366; random_last carries
static long ADDEND = 150889; state between random
static long PMOD = 714025; number computations,
long random_last = 0;
To make the generator
#pragma omp threadprivate(random_last)
threadsafe, make
double random ()
{
random_last
long random_next; threadprivate so each
thread has its own copy.
random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;
random_last = random_next;
return ((double)random_next/(double)PMOD);
}
131
Thread safe random number generators
Log10 number of samples
Thread safe
version gives the
1 same answer
1 2 3 4 5 6 LCG - one each time you
Log10 Relative error
0.1
thread run the program.
LCG 4 threads,
trial 1 But for large
0.01 number of
LCT 4 threads,
trial 2 samples, its
0.001 quality is lower
LCG 4 threads,
trial 3 than the one
thread result!
0.0001 LCG 4 threads,
thread safe Why?
0.00001
132
Pseudo Random Sequences
z Random number Generators (RNGs) define a sequence of pseudo-random
numbers of length equal to the period of the RNG
Thread 1
Thread 2
Thread 3
VSLStreamStatePtr stream;
#pragma omp threadprivate(stream)
…
vslNewStream(&ran_stream, VSL_BRNG_WH+Thrd_ID, (int)seed);
136
Independent Generator for each
thread
Log10 number of samples
Notice that
1 once you get
1 2 3 4 5 6 beyond the
high error,
Log10 Relative error
0.0001
137
Leap Frog method
z Interleave samples in the sequence of pseudo random numbers:
Thread i starts at the ith number in the sequence
Stride through sequence, stride length = number of threads.
z Result … the same sequence of values regardless of the number
of threads.
Used the MKL library with two generator streams per computation: one for the x values (WH) and
one for the y values (WH+1). Also used the leapfrog method to deal out iterations among threads.
139
Appendix: Solutions to exercises
z Exercise 1: hello world
z Exercise 2: Simple SPMD Pi program
z Exercise 3: SPMD Pi without false sharing
z Exercise 4: Loop level Pi
z Exercise 5: Monte Carlo Pi and random numbers
z Exercise 6: hard, linked lists without tasks
z Exercise 6: easy, matrix multiplication
z Exercise 7: Producer-consumer
z Exercise 8: linked lists with tasks
140
Linked lists without tasks
z See the file Linked_omp25.c
while (p != NULL) {
p = p->next; Count number of items in the linked list
count++;
}
p = head;
for(i=0; i<count; i++) {
parr[i] = p;
Copy pointer to each node into an array
p = p->next;
}
#pragma omp parallel
{
#pragma omp for schedule(static,1)
for(i=0; i<count; i++) Process nodes in parallel with a for loop
processwork(parr[i]);
}
Default schedule Static,1
One Thread 48 seconds 45 seconds
Two Threads 39 seconds 28 seconds
141
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
Linked lists without tasks: C++ STL
z See the file Linked_cpp.cpp
143
Matrix multiplication
#pragma omp parallel for private(tmp, i, j, k)
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
tmp = 0.0;
for(k=0;k<Pdim;k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j) */
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
145
Pair wise synchronizaion in OpenMP
146
Exercise 7: producer consumer
int main()
{
double *A, sum, runtime; int numthreads, flag = 0;
A = (double *)malloc(N*sizeof(double));
148
Linked lists with tasks (intel taskq)
z See the file Linked_intel_taskq.c
153