0% found this document useful (0 votes)
16 views

Lect11 Openmp1

Uploaded by

cvidal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lect11 Openmp1

Uploaded by

cvidal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Programming shared memory systems:

OpenMP
 Various methods to program shared memory systems
 Introduction to OpenMP
Programming shared memory systems

 Create multiple threads to use multiple cores


o Use a specialized programming language designed for shared memory multi-
thread programming. E.g Cilk plus. These are not mainstream yet.
o Use library support for shared memory multi-threading. E.g. Python
multiprocessing shared memory module
o Use parallelizing compiler
o Use compiler directives that supplement a sequential programming language.
E.g. OpenMP (with C/C++, fortran)
o Use processes
o Use threads
Programming shared memory systems

 Although there are different ways for shared memory programming,


issues in programming shared memory systems are similar. Any
shared memory programming (or parallel programming) needs to
address two issues.
o How to create multiple execution threads, mostly the fork-join model.
o How to share data and coordinate – e.g. avoiding race condition

 We will review processes and threads, but focus on OpenMP


o OpenMP is widely supported: Gcc supports OpenMP
o OpenMP is relatively easy to understand and use
Use processes

 A topic in the UG OS class.


 Heavyweight
o Process creation, termination, and coordination are expensive.
 UNIX system calls: fork(), exit(), wait().
o Inter-process communication is somewhat awkward because process is designed
for isolation, not sharing.
 Explicit shared memory operations: shmget(), shmat(), and shmctl()

 Due to its overhead, although processes can be used for shared memory
programming, they are rarely used in practice.
Use threads

 A topics in the UG OS class


 A thread is a lightweight processes: creation, termination, communication
are all much cheaper than those for process.
 Pthreads: POSIX-compliant threads
o Thread creation, termination, and coordination: pthread_create(), pthread_exit(),
and pthread_join().
o Communication is through shared memory. Pthread routines to coordinating memory
access: pthread_mutex_lock(), pthread_cond_wait(), pthread_cond_signal() 。
 Very easy to write wrong programs. Some legacy system routines are still not “thread-safe.”
o Programmer needs to explicitly manage the threads.
OpenMP
 What does OpenMP stands for?
o Open specifications for Multi Processing via collaborative work between
interested parties from the hardware and software industry, government
and academia.
 OpenMP is an Application Program Interface (API) that may be used
to explicitly direct multi-threaded, shared memory parallelism.
o API components: Compiler Directives, Runtime Library Routines.
Environment Variables
 OpenMP is a directive-based method to invoke parallel computations on share-
memory multiprocessors
OpenMP

 OpenMP API is specified for C/C++ and Fortran.


 OpenMP is not intrusive to the original serial code: instructions
appear in comment statements for fortran and pragmas for C/C++.
 OpenMP website: http://www.openmp.org
o Materials in this lecture are taken from various OpenMP tutorials in the
website and other places.
Why OpenMP

 OpenMP is portable: It is jointly defined and endorsed by a group of


major hardware and software vendors.
o Hoping for an ANSI standard

 OpenMP can be implemented incrementally, one function or even


one loop at a time.
o A nice way to get a parallel program from a sequential program.
How to compile and run OpenMP programs?

 To compile: use gcc with flag -fopenmp


o Example: gcc –fopenmp a.c
o Try lect11/helloworld.cpp, lect11/helloworld1.cpp – see how easy it is to write wrong
shared memory program.
 To run: ‘a.out’
o To change the number of threads (the default is the number of cores on the machine):
 setenv OMP_NUM_THREADS 4 (tcsh) or export OMP_NUM_THREADS=4(bash)
 Put it togather, on linprog:
 gcc –fopenmp helloworld1.cpp
 Export OMP_NUM_THREADS = 4
 ./a.out -- run the program with 4 threads
OpenMP basic syntax

 Most of the constructs in OpenMP are compiler directives


#pragma omp construct [clause [clause] …]
o Example
#pragma omp parallel num_threads(8)
 OpenMP constructs usually apply to a structured block (one
statement).
 Function prototypes are in
#include <omp.h>
OpenMP “Hello world” program
#include <omp.h>
#include <iostream> OpenMP include file
#include <sstream>

using namespace std;


int main()
{
stringstream ss[100];
int nprocs;
OpenMP directive, parallel region with
the default number of threads
#pragma omp parallel
{
int total_threads = omp_get_num_threads();
int myid = omp_get_thread_num();
if (myid == 0) nprocs = total_threads; OpenMP library routines that return the number
ss[myid] << "Hello world, I am No." << myid << " out of "
<< total_threads << " Members\n"; Of threads and the thread ID respectively.
}
for (int i=0; i<nprocs; i++) {
string a; while (ss[i] >> a) cout << a << " "; cout << "\n";
}
return 0;
}
OpenMP execution model

 OpenMP uses the fork-join model of parallel execution.


o All OpenMP programs begin with a single master thread.
o The master thread executes sequentially until a parallel region is encountered, when it
creates a team of parallel threads (FORK).
o When the team threads complete the parallel region, they synchronize and terminate,
leaving only the master thread that executes sequentially (JOIN).
OpenMP general code structure
#include <omp.h>
main () {
int var1, var2, var3;
Serial code
...
/* Beginning of parallel section. Fork a team of threads. Specify variable scoping*/
#pragma omp parallel private(var1, var2) shared(var3)
{
/* Parallel section executed by all threads */
...
/* All threads join master thread and disband*/
}
Resume serial code
...
}
An example lect11/simple_multithread.cpp
 If variables are not specified as public or private. All variables outside the parallel
region are public variables (shared by all threads)
 What about variables declared inside the parallel block?
 What if the ‘nprocs = total_threads;’ statement in the program is moved before the if
statement?
 Easy for OpenMP to create threads
o Deterministic multi-threaded program: no race condition – no two threads can access one
memory location with one writing to it!
o Non-deterministic multi-threaded programs are mostly incorrect programs.
o Need to be mindful when making system/library routine calls.
Data model

• Private and shared variables


• Variables in the global data space are accessed
by all parallel threads (shared variables).
• Variables in a thread’s private space can only be
accessed by the thread (private variables)
• several variations, depending on the initial values and
whether the results are copied outside the region.

• OpenMP allows for specification of private and


shared variable before entering each parallel region.
FOR ( I = 0; I < ARRAYSIZE; I++ ) {

FOR ( PRIVINDX = 0; PRIVINDX < 16; PRIVINDX++ ) {

PRIVDBL = ( (DOUBLE) PRIVINDX ) / 16;

Y[I] = SIN( EXP( COS( - EXP( SIN(X[I]) ) ) ) ) + COS( PRIVDBL );

Can this loop run in parallel? On what condition?


If one has a choice, one should parallelize inner or outer loop?
#PRAGMA OMP PARALLEL FOR PRIVATE( PRIVINDX, PRIVDBL )

FOR ( I = 0; I < ARRAYSIZE; I++ ) {

FOR ( PRIVINDX = 0; PRIVINDX < 16; PRIVINDX++ ) {

PRIVDBL = ( (DOUBLE) PRIVINDX ) / 16;

Y[I] = SIN( EXP( COS( - EXP( SIN(X[I]) ) ) ) ) + COS( PRIVDBL );

Parallel for loop index variable is


private by default.
OpenMP directives

 Format:
#progma omp directive-name [clause,..] newline
(use ‘\’ for multiple lines)
 Example:
#pragma omp parallel default(shared) private(beta,pi)

 Scope of a directive is one block of statements { …}


 More detail in the specification at http://www.openmp.org/ (OpenMP
5.0 at https://www.openmp.org/spec-html/5.0/openmp.html)
Parallel region construct
 A block of code that will be executed by multiple threads.
#pragma omp parallel [clause …]
{
……
} (implied barrier)

Clauses: if ([parallel :]expression), num_thread(integer-expression), private (list),


shared (list), default (shared | none), reduction ([reduction-modifier,] reduction-
modifier: list), firstprivate(list), copyin(list), proc_bind(master | close | spread),
allocate([allocator:} list)

o private(list): everything private and local (no relation with variables outside the
block).
o shared(list): data accessed by all threads
o default (none| shared)
The reduction clause (see
lect11/reduction.cpp)
Sum = 0.0;
#pragma parallel for default(none) shared (n, x) private (I) reduction(+ : sum)
{
For(I=0; I<n; I++) sum = sum + x(I);
}

 Updating sum must avoid racing condition


 With the reduction clause, OpenMP generates code such that the
race condition is avoided.
o See also lect11/reduction1.cpp and lect11/reduction2.cpp.
Clauses
 Firstprivate(list): variables are private, but are initialized with the value
before entering the block.
 copyin(list): The copyin clause provides a mechanism to copy the value of
a threadprivate variable of the master thread to the threadprivate variable
of each other member of the team that is executing the parallel region.
 proc_bind(master | close | spread): thread affinity related,
to talk more in the next class (OpenMP for NUMA
architectures)
 allocate([allocator :] list): memory allocation related, to talk
more in the next class. (OpenMP for NUMA architectures)
Work-sharing constructs
 #pragma omp for [clause …]
 #pragma omp sections [clause …]
 #pragma omp single [clause …]

 These must be enclosed in parallel region


 No implied barrier on entry, implied barrier on exit (unless specified otherwise)
 “omp for”: loop iterations are distributed over the number of threads.
 “omp sections”: each section is assigned to one thread
 “omp single”: only one thread runs the block
The omp for directive: example
omp for shares loop iteration among different threads. It can have a schedule clause
schedule (static | dynamic | guided [, chunk])
The omp session clause – example (see also
lect11/simple_multithread1.cpp)
Omp single construct

 The single construct specifies that the associated structured block is


executed by only one of the threads in the team (not necessarily the
master thread), in the context of its implicit task. The other threads
in the team, which do not execute the block, wait at an implicit
barrier at the end of the single construct unless a nowait clause is
specified.
 See lect11/omp_single.cpp
Some merged constructs
Synchronization: barrier
For(I=0; I<N; I++)  Both loops are in parallel region
a[I] = b[I] + c[I];
With no synchronization in between.
For(I=0; I<N; I++) What is the problem?
d[I] = a[I] + b[I]
 The second loop should not start
For(I=0; I<N; I++)
a[I] = b[I] + c[I]; before all of the first loop finishes.

#pragma omp barrier

For(I=0; I<N; I++)


d[I] = a[I] + b[I]
Critical session
For(I=0; I<N; I++) {
……
Cannot be parallelized if sum is shared.
sum += A[I];
See lect11/reduction1.cpp.
……
What if we comment out ‘#pragma omp critical’ in
}
the program?
For(I=0; I<N; I++) {
……
#pragma omp critical
{
sum += A[I];
}
……
}
OpenMP environment variables

 OMP_NUM_THREADS
 OMP_SCHEDULE
OpenMP runtime environment

 omp_get_num_threads()
 omp_get_thread_num()
 omp_in_parallel()
 Routines related to locks
 ……
OpenMP examples

 lect11/piomp.c
Sequential Matrix Multiply

For (I=0; I<n; I++)


for (j=0; j<n; j++)
c[I][j] = 0;
for (k=0; k<n; k++)
c[I][j] = c[I][j] + a[I][k] * b[k][j];
OpenMP Matrix Multiply (lect11/mm_omp.c)

#pragma omp parallel for private(j, k)


For (I=0; I<n; I++)
for (j=0; j<n; j++)
c[I][j] = 0;
for (k=0; k<n; k++)
c[I][j] = c[I][j] + a[I][k] * b[k][j];
OpenMP Summary
o OpenMP provides a compact, yet powerful programming model for shared memory
programming
 It is very easy to use OpenMP to create parallel programs.
o OpenMP preserves the sequential version of the program
o Developing an OpenMP program:
 Start from a sequential program
 Identify the code segment that takes most of the time.
 Determine whether the important loops can be parallelized
 The loops may have critical sections, reduction variables, etc
 Determine the shared and private variables.
 Add directives

You might also like