OPENMP
OPENMP
with OPENMP
OPENMP: Motivation
Sequential program uses a single core/
processor while all other processors
are idle.
Threads are
numbered from 0
Structured block of code (master thread) to N-1
Implicit barrier at the
end of a parallel section.
master thread executes
sequentially until the
Join: Team of threads complete first parallel region is
the statements in the parallel encountered.
region, synchronize and Parallelism added
terminate incrementally until
performance goals are
met.
OPENMP: Basic functions
OPENMP: basic functions
Each thread has its own stack, so it
will have its own private (local)
variables.
Each thread gets its own rank -
omp_get_thread_num
The number of threads in the team -
omp_get_num_threads
In OpenMP, stdout is shared among
the threads, so each thread can
execute the printf statement.
There is no scheduling of access
to stdout, output is non-
OPENMP: Run Time Functions
Create a 4 thread Parallel region :
Statements in the program that are enclosed by the parallel
region construct are executed in parallel among the various
team threads.
Data
Sharing/Scope
Matrix Vector
Multiplication
Is this reasonable?
Matrix Vector
Multiplication #pragma omp parallel shared(A,x,y,SIZE) \
private(tid,i,j,istart,iend)
{
tid = omp_get_thread_num();
int nid = omp_get_num_threads();
istart = tid*SIZE/nid;
iend = (tid+1)*SIZE/nid;
Thread 0
Global_result +=2
Mutual Exclusion:
Only one thread at a time executes the
statement
Thread1
Global_result +=3
Pretty much sequential
Thread2
Global_result +=4
Handling Race Conditions
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for shared(sum)
private(x) Mutual Exclusion:
for (I = 0; I <= num_steps; i++) { Only one thread at a
x = (I + 0.5) * step; time executes the
#pragma omp critical
sum = sum + 4.0 / (1.0 + x*x); statement
} sum = sum + 4.0 / (1.0 +
x*x);
sum = 0;
set_omp_num_threads(8)
#pragma omp parallel
for
reduction (+:sum)
for (int i = 0; i < 16; i++)
{
sum += a[i]
}
Thread0 => iteration 0 &
1
Thread1 => iteration 2 &
3
Thread local/private
………
One or more variables that are private to each thread are subject of
reduction operation at the end of the parallel region.
#pragma omp for reduction(operator : var)
Operator: + , * , - , & , | , && , ||, ^
Combines multiple local copies of the var from threads into a single copy at
master.
Computing ∏ by method of Numerical
Integration
static long num_steps = 100000; #include <omp.h>
double step; #define NUM_THREADS 4
static long num_steps = 100000;
void main ()
double step;
{ void main ()
int i; double x, pi, sum = 0.0; {
step = 1.0 / (double) int i; double x, pi, sum = 0.0;
num_steps; step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i+ omp_set_num_threads(NUM_THREADS);
+) { #pragma omp parallel for
x = (I + 0.5) * step; reduction(+:sum) private(x)
sum = sum + 4.0 / (1.0 + for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
x*x);
sum += 4.0 / (1.0 + x*x);
} }
pi = step * sum pi = step * sum
} }