0% found this document useful (0 votes)
28 views32 pages

4 Openmp

This document provides an overview of OpenMP, a programming model for parallel programming on shared memory architectures. It discusses key OpenMP concepts like parallel constructs, work sharing directives, synchronization constructs, and memory models. Examples are provided for parallelizing a matrix multiplication using OpenMP pragmas and scheduling clauses. Overall, the document introduces the basic syntax and execution model of OpenMP for parallelizing loops and distributing work across threads.

Uploaded by

girishcherry12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views32 pages

4 Openmp

This document provides an overview of OpenMP, a programming model for parallel programming on shared memory architectures. It discusses key OpenMP concepts like parallel constructs, work sharing directives, synchronization constructs, and memory models. Examples are provided for parallelizing a matrix multiplication using OpenMP pragmas and scheduling clauses. Overall, the document introduces the basic syntax and execution model of OpenMP for parallelizing loops and distributing work across threads.

Uploaded by

girishcherry12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CSL 860: Modern Parallel

Computation
Hello OpenMP
#pragma omp parallel
{
// I am now thread i of n
switch(omp_get_thread_num()) { Parallel
case 0 : blah1..
case 1: blah2.. Construct
}
}
// Back to normal

• Extremely simple to use and incredibly powerful


• Fork-Join model
• Every thread has its own execution context
• Variables can be declared shared or private
Execution Model
• Encountering thread creates a team:
– Itself (master) + zero or more additional threads.
• Applies to structured block immediately following
– Each thread executes a copy of the code in {}
• But, also see: Work-sharing constructs
• There’s an implicit barrier at the end of block
• Only master continues beyond the barrier
• May be nested
– Sometimes disabled by default
Memory Model
• Notion of temporary view of memory
– Allows local caching
– Need to flush memory
– T1 writes -> T1 flushes -> T2 flushes -> T2 reads
– Same order seen by all threads
• Supports threadprivate memory
• Variables declared before parallel construct:
– Shared by default
– May be designated as private
– n-1 copies of the original variable is created
• May not be initialized by the system
Shared Variables
• Heap allocated storage
• Static data members
• const-qualified (no mutable members)
• Private:
– Variables declared in a scope inside the construct
– Loop variable in for construct
• private to the construct
• Others are shared unless declared private
– You can change default
• Arguments passed by reference inherit from original
Beware of Compiler Re-ordering

a=b=0
thread 1 thread 2

b=1 a=1
flush(b); flush(a); flush(a); flush(b);
if (a == 0) { if (b == 0) {
critical section critical section
} }
Beware more of Compiler Re-ordering

// Parallel construct
{
int b = initialSalary
print(“Initial Salary was %d\n”, initialSalary);
Book-keeping() // No read b or write initialSalary

if (b < 10000) {
raiseSalary(500);
}
}
Thread Control
E nvironment Variable Ways to modify value Way to retrieve value Initial value

OMP_NUM_THREADS omp_set_num_threads omp_get_max_threads Implementation


* defined

OMP_DYNAMIC omp_set_dynamic omp_get_dynamic Implementation


defined

OMP_NESTED omp_set_nested omp_get_nested false

OMP_SCHEDULE Implementation
* defined

* Also see construct clause: num_threads, schedule


Parallel Construct
#pragma omp parallel \
if(boolean) \
private(var1, var2, var3) \
firstprivate(var1, var2, var3) \
default(shared | none) \
shared(var1, var2), \
copyin(var2), \
reduction(operator:list) \
num_threads(n)
{
}
Parallel Loop
#pragma omp parallel for
for (i= 0; i < N; ++i) {
blah …
}
• No of iterations must be known when the
construct is encountered
– Must be the same for each thread
• Compiler puts a barrier at the end of parallel for
– But see nowait
Parallel For
#pragma omp for \
private(var1, var2, var3) \
firstprivate(var1, var2, var3) \
lastprivate(var1, var2), \
reduction(operator: list), \
ordered, \
schedule(kind[, chunk_size]), \
nowait
Canonical For Loop
No loop break
Schedule(kind[, chunk_size])
• Divide iterations into contiguous sets, chunks
– chunks are assigned transparently to threads
• static: iterations are divided among threads in a round-robin
fashion
– When no chunk_size is specified, approximately equal chunks are
made
• dynamic: iterations are assigned to threads in ‘request order’
– When no chunk_size is specified, it defaults to 1.
• guided: like dynamic, the size of each chunk is proportional to the
number of unassigned iterations divided by the number of threads
– If chunk_size =k, chunks have at least k iterations (except the last)
– When no chunk_size is specified, it defaults to 1.
• runtime: taken from environment variable
Single
#pragma omp parallel
{
#pragma omp for
for( int i=0; i<N; i++ ) a[i] = f0(i);
#pragma omp single
x = f1(a);
#pragma omp for
for(int i=0; i<N; i++ ) b[i] = x * f2(i);
}
• Only one of the threads executes
• Other threads wait for it
– unless NOWAIT is specified
• Hidden complexity
– Threads may be at different instructions
Sections
#pragma omp sections
{
#pragma omp section
{
// do this …
}
#pragma omp section
{
// do that …
}
// …
}
• The omp section directives must be closely nested in a sections construct,
where no other work-sharing construct may appear.
Private Variables
#pragma omp parallel private (size, …) for
for ( int i = 0; i = numThreads; i++) {
int size = numTasks/numThreads;
int extra = numTasks – numThreads*size;
if(i < extra) size ++;
doTask(i, size, numThreads);
}

doTask(int start, int count)


{ // Each thread’s instance has its own activation record
for(int i = 0, t=start; i< count; i++; t+=stride)
doit(t);
}
}
Firstprivate and Lastprivate
• Initial value of private variable is unspecified
– firstprivate initializes copies with the original
– Once per thread (not once per iteration)
– Original exists before the construct
• Only the original copy is retained after the construct
• lastprivate forces sequential-like behavior
– thread executing the sequentially last iteration (or last
listed section) writes to the original copy
Firstprivate and Lastprivate
#pragma omp parallel for firstprivate( simple )
for (int i=0; i<N; i++) {
simple += a[f1(i, omp_get_thread_num())]
f2(simple);
}

#pragma omp parallel for lastprivate( doneEarly )


for( i=0; (i<N || doneEarly; i++ ) {
doneEarly = f0(i);
Other Synchronization Directives
#pragma omp master
{
}
– binds to the innermost enclosing parallel region
– Only the master executes
– No implied barrier
Master Directive
#pragma omp parallel
{
#pragma omp for
for( int i=0; i<100; i++ ) a[i] = f0(i);
Only master executes.
#pragma omp master No synchronization.
x = f1(a);
}
Critical Section
#pragma omp critical accessBankBalance
{
}
– A single thread at a time
– Applies to all threads
– The name is optional; no name implies global
critical region
Barrier Directive
#pragma omp barrier
– Stand-alone
– Binds to inner-most parallel region
– All threads in the team must execute
• they will all wait for each other at this instruction
• Dangerous:
if (! ready)
#pragma omp barrier

– Same sequence of work-sharing and barrier for the


entire team
Ordered Directive
#pragma omp ordered
{
}
• Binds to inner-most enclosing loop
• The structured block executed in sequential
order
• The loop must declare the ordered clause
• May encounter only one ordered regions
Flush Directive
#pragma omp flush (var1, var2)
– Stand-alone, like barrier
– Only directly affects the encountering thread
– List-of-vars ensures that any compiler re-ordering
moves all flushes together
Atomic Directive
#pragma omp atomic
i++;

• Light-weight critical section


• Only for some expressions
– x = expr (no mutual exclusion on expr evaluation)
– x++
– ++x
– x--
– --x
Reductions
• Reductions are so common that OpenMP provides
support for them
• May add reduction clause to parallel for
pragma
• Specify reduction operation and reduction variable
• OpenMP takes care of storing partial results in
private variables and combining partial results after
the loop
reduction Clause
• reduction (<op> :<variable>)
– + Sum
– * Product
– & Bitwise and
– | Bitwise or
– ^ Bitwise exclusive or
– && Logical and
– || Logical or
• Add to parallel for
– OpenMP creates a loop to combine copies of the
variable
– The resulting loop may not be parallel
Nesting Restrictions
• A work-sharing region may not be closely nested inside a
work-sharing, critical, ordered, or master region.
• A barrier region may not be closely nested inside a work-
sharing, critical, ordered, or master region.
• A master region may not be closely nested inside a work-
sharing region.
• An ordered region may not be closely nested inside a
critical region.
• An ordered region must be closely nested inside a loop
region (or parallel loop region) with an ordered clause.
• A critical region may not be nested (closely or otherwise)
inside a critical region with the same name. Note that this
restriction is not sufficient to prevent deadlock
EXAMPLES
OpenMP Matrix Multiply
#pragma omp parallel for
for(int i=0; i<n; i++ )
for( int j=0; j<n; j++ ) {
c[i][j] = 0.0;
for(int k=0; k<n; k++ )
c[i][j] += a[i][k]*b[k][j];
}
• a, b, c are shared
• i, j, k are private
OpenMP Matrix Multiply: Triangular
#pragma omp parallel for schedule (dynamic, 1 )
for( int i=0; i<n; i++ )
for( int j=i; j<n; j++ ) {
c[i][j] = 0.0;
for(int k=i; k<n; k++ )
c[i][j] += a[i][k]*b[k][j];
}
• This multiplies upper-triangular matrix A with B
• Unbalanced workload
– Schedule improves this
OpenMP Jacobi
for some number of timesteps/iterations {
#pragma omp parallel for
for (int i=0; i<n; i++ )
for( int j=0, j<n, j++ )
temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1][j]
grid[i][j-1] + grid[i][j+1] );
#pragma omp parallel for
for( int i=0; i<n; i++ )
for( int j=0; j<n; j++ )
grid[i][j] = temp[i][j];
}
• This could be improved by using just one parallel region
• Implicit barrier after loops eliminates race on grid
OpenMP Jacobi
for some number of timesteps/iterations {
#pragma omp parallel for
for (int i=0; i<n; i++ )
for( int j=0, j<n, j++ ) {
temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1][j]
grid[i][j-1] + grid[i][j+1] );
#pragma omp barrier
grid[i][j] = temp[i][j];
}
}

• Is barrier sufficient?
• What change to the code is needed?
– Recall barrier is per-team

You might also like