04

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Shared Memory Parallel

Programming
Introduction to OpenMP

These slides were originally written by Dr. Barbara Chapman, University of Houston
Outline
• Introduction to OpenMP
• Parallel Programming with OpenMP
– Worksharing, tasks, data environment, synchronization
• OpenMP Performance and Best Practices
• Hybrid MPI/OpenMP
• Case Studies and Examples
• Reference Materials

2
OpenMP* Overview
C$OMP FLUSH #pragma omp critical
C$OMP THREADPRIVATE(/ABC/) CALL OMP_SET_NUM_THREADS(10)
C$OMP OpenMP: An API for
parallel do shared(a, Writing
b, c) Multithreaded
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
Applications
C$OMP MASTER
C$OMP ATOMIC
C$OMP Stands
SINGLE for Open
PRIVATE(X) Multi-Processing
setenv OMP_SCHEDULE “dynamic”
A setDOofORDERED
C$OMP PARALLEL compiler directives
PRIVATE (A, B, and
C) library
C$OMP ORDERED
routines for parallel application programmers
C$OMP PARALLEL REDUCTION (+: A, B)
Greatly simplifies writing multi-threaded (MT)
C$OMP SECTIONS

programs
#pragma omp parallelin Fortran,
for C and
private(A, B) C++!$OMP BARRIER
Standardizes
C$OMP PARALLEL last 20 years
COPYIN(/blk/) C$OMP of
DO SMP practice
lastprivate(XX)

Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck)


* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
What is OpenMP?

• Industry standard for shared memory


programming for scientific applications
– Initial focus on scientific applications
• Main ideas:
– Support productivity
– Provide portability
• For Fortran, C and C++
Using OpenMP

• Widely available
• Single source code: parallel
and sequential code
• Ease of use, incremental
approach to programming
• Can be combined with MPI
to create “hybrid” code

Flexibility: threadids allow for


explicit multithreaded programming
The OpenMP ARB 2008

• OpenMP is maintained by the OpenMP Architecture Review


Board (the ARB), which
• Interprets OpenMP
• Writes new specifications - keeps OpenMP relevant
• Works to increase the impact of OpenMP
• Members are organizations - not individuals
– Current members
• Permanent: AMD, Cray, Fujitsu, HP, IBM, Intel, Microsoft, NEC, PGI,
SGI, Sun
• Auxiliary: ASCI, cOMPunity, EPCC, KSL, NASA, RWTH Aachen

www.compunity.org
The OpenMP ARB 2011

• OpenMP is maintained by the OpenMP Architecture


Review Board (the ARB), which
• Interprets OpenMP
• Writes new specifications - keeps OpenMP relevant
• Works to increase the impact of OpenMP
• Members are organizations - not individuals
– Current members
• Permanent: AMD, CAPS Entreprise, Cray, Fujitsu, HP, IBM, Intel,
Microsoft, NEC, Nvidia, Oracle, PGI, Texas Instruments
• Auxiliary: ANL, cOMPunity, EPCC, NASA, LANL, ASC/LLNL,
ORNL, RWTH Aachen, TACC

www.openmp.org
OpenMP Meeting 2013

8
• Oct 1997 – 1.0 Fortran
• Oct 1998 – 1.0 C/C++
• Nov 1999 – 1.1 Fortran (interpretations added)
• Nov 2000 – 2.0 Fortran
• Mar 2002 – 2.0 C/C++
• May 2005 – 2.5 Fortran/C/C++ (mostly a merge)
• Apr 2008 – 3.0 Fortran/C/C++ (extensions)
• July 2011 – 3.1 Fortran/C/C++ (extensions)
• March 2013 – 4.0 Fortran/C/C++ (extensions)
• Nov 2015 – 4.5 Fortran/C/C++ (extensions)
• Nov 2018 – 5.0 Fortran/C/C++ (extensions)
• Nov 2020 – 5.1 Fortran/C/C++ (extensions)

• Committees work to maintain API, keep it relevant:


– “Keep it simple”
– As far as possible, keep implementations consistent

http://www.openmp.org
OpenMP Overview
• A set of compiler directives inserted in the source
program
• Also some library functions
• Ideally, compiler directives do not affect
sequential code
– pragmas in C / C++
– (specially written) comments in Fortran code
The OpenMP Shared Memory API
• High-level directive-based multithreaded programming
– The user makes strategic decisions
– Compiler figures out details
– Threads communicate by sharing variables
– Synchronization to order accesses and prevent data conflicts
– Structured programming to reduce likelihood of bugs

#pragma omp parallel


#pragma omp for schedule(dynamic)
for (I=0;I<N;I++){
NEAT_STUFF(I);
} /* implicit barrier here */

11
Summary: What is OpenMP?
• De-facto standard API to write shared memory
parallel applications in C, C++, and Fortran
• Consists of:
– Compiler directives
– Runtime routines
– Environment variables
• Initial version released end of 1997
– For Fortran only
– Subsequent releases for C, C++
• Version 2.5 merged specs for all three languages
12
OpenMP Components
Runtime Environment
Directives environment variables
• Parallel region • Number of threads • Number of threads
• Worksharing constructs • Thread ID • Scheduling type
• Dynamic thread
• Tasking adjustment • Dynamic thread
adjustment
• Synchronization • Nested parallelism
• Nested parallelism
• Data-sharing attributes • Schedule
• Active levels • Stacksize
• Thread limit • Idle threads
• Nesting level • Active levels
• Ancestor thread • Thread limit
• Team size
• Locking
• Wallclock timer
13
OpenMP Syntax
• Most OpenMP constructs are compiler directives using pragmas.
– For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
– For Fortran, the directives take one of the forms:
• Fixed form
*$OMP construct [clause [clause]…]
C$OMP construct [clause [clause]…]
• Free form (but works for fixed form too)
!$OMP construct [clause [clause]…]
• Include file and the OpenMP lib module
#include <omp.h>
use omp_lib

OpenMP sentinel forms: #pragma omp !$OMP


14
Idea of OpenMP
Sequential code:
statement1;
statement2;
statement3;
Assume we want to execute statement 2 in parallel,
and statement 1 and 3 sequentially
Idea of OpenMP

statement 1;
#pragma <specific OpenMP directive>
statement2;
statement3;
Statement 2 (may be) executed in parallel
Statement 1 and 3 are executed sequentially
Idea of OpenMP

statement 1
!$OMP <specific OpenMP directive>
statement2
!$OMP END <specific OpenMP directive>
statement3
Statement 2 (may be) executed in parallel
Statement 1 and 3 are executed sequentially
Basic Idea of OpenMP

• Program has sequential part and parallel parts


• Initial (master) thread executes sequential part
• Master and slaves execute parallel parts
– Initial thread creates team of slave threads and
becomes master of the team
– fork-join approach
OpenMP Fork-Join Execution Model
• Master thread spawns multiple worker threads
as needed, together form a team
• Parallel region is a block of code executed by
all threads in a team simultaneously
Barrier
Master thread

Worker thread A Nested


Parallel
region
Parallel Regions
19
Basic Idea of OpenMP

• Each thread performs part of the work


• One thread per processor (or more if
multicore or multithreading)
– But not time-slicing
• Sequential parts executed by single thread
• Dependences in parallel parts require
synchronization between threads
Role of User

• User inserts directives telling compiler how


statements are to be executed
– what parts of the program are parallel
– how to assign code in parallel regions to threads
– what data is private (local) to threads
• Compiler generates explicit threaded code
Role of User

• User must remove any dependences in parallel


parts
• Or introduce appropriate synchronization
• OpenMP compiler does not check for them!
– It is up to programmer to ensure correctness
• Some tools exist to help check this
OpenMP Compiler
• OpenMP: thread programming at “high level”.
– The user does not need to specify all the details
• Assignment of work to threads
• Creation of threads
• Compiler figures out details
– Generates multithreaded code with calls to its runtime
– Runtime starts threads, passes work to them,
organizes synchronization

• Compiler must be “told” to process OpenMP


– Compiler flags (non-standard) enable OpenMP (e.g. –
openmp, -xopenmp, -fopenmp, -mp)
– Else OpenMP is ignored

23
Status of OpenMP Implementation

• OpenMP compiler translates code and user


directives into multithreaded application
– Is part of most standard compilers today
• Works on true shared memory machines (SMPs)
and DSM architectures
• The runtime is custom: a compiler has its own
• We look briefly at implementation strategy later
OpenMP Usage

sequential Sequential
compiler Program

Annotated Fortran/C/C++
Source compiler

Parallel
OpenMP Program
compiler
OpenMP Usage

• If program is compiled sequentially


– OpenMP comments and pragmas are ignored
• If code is compiled for parallel execution
– comments and/or pragmas are read, and
– drive translation into parallel program
• Ideally, one source for both sequential and
parallel program (big maintenance plus)
Where Does OpenMP Run?
Hardware OpenMP
Platforms support
CPU CPU CPU CPU
Shared Memory Available
Systems
cache cache cache cache
Distributed Shared Available
Memory Systems
(ccNUMA) Shared bus
Distributed Memory via Software
Systems DSM Shared Memory

(Hyperthreading and Available Shared memory architecture


other kinds of Chip
MultiThreading)
Are Caches “Coherent” or Not?
• Coherence means different copies of same location have same
value, incoherent otherwise:
• p1 and p2 both have cached copies of data (= 0)
• p1 writes data=1
– May “write through” to memory
• p2 reads data, but gets the “stale” cached copy
– This may happen even if it read an updated value
of another variable, flag, that came from memory

data = 0

data 1
data 0 data 0

p1 p2
CS267 Lecture 6 28
OpenMP Memory Model
• OpenMP assumes a shared memory
• Threads communicate by sharing variables.

• Synchronization protects data conflicts.


– Synchronization is expensive.
• Change how data is accessed to minimize the need for synchronization.

29
How do threads interact?
• OpenMP is a shared memory model.
• Threads interact (“communicate”) by sharing variables
• Unintended sharing of data causes race conditions:
• the program’s outcome may change if the threads are
scheduled differently
• To prevent race conditions:
• Use synchronization to order data access and protect data
conflicts
• Synchronization is expensive so:
• Do what you can to change how data is accessed to minimize
the need for synchronization.
OpenMP Parallel Computing Solution Stack

End User

Application

Directives, Environment
OpenMP library
Compiler variables

Runtime library

OS/system

31
Parallel Regions
• You create threads in OpenMP with the “omp
parallel” pragma.
• For example, to create a 4 thread parallel region:
Runtime function
double A[1000]; to request a
Each thread
executes a omp_set_num_threads(4); certain number of
copy of the #pragma omp parallel threads
code within {
the int ID = omp_get_thread_num();
structured pooh(ID,A); Runtime function
block returning a thread
}
ID
 Each thread calls pooh(ID,A) for ID = 0 to 3
Parallel Regions
double A[1000];
• Each thread executes the omp_set_num_threads(4);
same code redundantly. #pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);

A single
copy of A pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
is shared
between all
threads.
printf(“all done\n”); Threads wait here for all threads to
finish before proceeding (i.e. a barrier)
OpenMP: Structured blocks (C/C++)
• Most constructs apply to structured blocks
• Structured block: a block with one point of entry at the top and one
point of exit at the bottom.
• The only “branches” allowed are STOP statements in Fortran and
exit() in C/C++
OK NOT
#pragma omp parallel if(go_now()) goto more;
{ #pragma omp parallel OK
int id = omp_get_thread_num(); {
more: res[id] = do_big_job(id); int id = omp_get_thread_num();
if(!conv(res[id]) goto more; more: res[id] = do_big_job(id);
} if(conv(res[id]) goto done;
printf(“ All done \n”); go to more;
}
done: if(!really_done()) goto more;

A structured block Not a structured block


OpenMP Parallel Regions
• In C/C++: a block is a single statement or a group of statement
between { }
#pragma omp parallel #pragma omp parallel for
{ for(i=0;i<N;i++) {
id = omp_get_thread_num(); res[i] = big_calc(i);
res[id] = lots_of_work(id); A[i] = B[i] + res[i];
} }

• In Fortran: a block is a single statement or a group of statements


between directive/end-directive pairs.
C$OMP PARALLEL C$OMP PARALLEL DO
10 wrk(id) = garbage(id) do i=1,N
res(id) = wrk(id)**2 res(i)=bigComp(i)
if(.not.conv(res(id)) goto 10 end do
C$OMP END PARALLEL C$OMP END PARALLEL DO
35
Scope of OpenMP Region
A parallel region can span multiple source files.

bar.f
foo.f
C$OMP PARALLEL subroutine whoami
call whoami + external omp_get_thread_num
C$OMP END PARALLEL integer iam, omp_get_thread_num
iam = omp_get_thread_num()
C$OMP CRITICAL
Static/lexical Dynamic extent print*,’Hello from ‘, iam
extent of of parallel C$OMP END CRITICAL
parallel region region includes Orphaned directives
return can appear outside a
lexical extent
end parallel construct

36
Exercise:
A multi-threaded “Hello world” program
• Write a multithreaded program where each thread
prints “hello world”.

void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);

}
A multi-threaded “Hello world” program
• Write a multithreaded program where each thread
prints “hello world”.
#include “omp.h” OpenMP include file
void main() Parallel region with default Sample Output:
{ number of threads
hello(1) hello(0) world(1)
#pragma omp parallel
world(0)
{
hello (3) hello(2)
int ID =
world(2)
omp_get_thread_num();
printf(“ hello(%d) ”, ID); world(3)
printf(“ world(%d) \n”, ID);
} } End of the parallel region Runtime library function to
return a thread ID.
39

You might also like