180px | |
Original author(s) | OpenMP Architecture Review Board[1] |
---|---|
Developer(s) | OpenMP Architecture Review Board[1] |
Stable release | 3.1[2] / July 9, 2011 |
Written in | C, C++, Fortran |
Operating system | Cross-platform |
Platform | Cross-platform |
Type | API |
License | Various[3] |
Website | openmp.org |
OpenMP (Open Multiprocessing) is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems, including Solaris, AIX, HP-UX, GNU/Linux, Mac OS X, and Windows platforms. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.[4][5][6]
OpenMP is managed by the nonprofit technology consortium OpenMP Architecture Review Board (or OpenMP ARB), jointly defined by a group of major computer hardware and software vendors, like AMD, IBM, Intel, Cray, HP, Fujitsu, Nvidia, NEC, Microsoft, Texas Instruments, Oracle Corporation, and more.[7]
OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer.[8]
An application built with the hybrid model of parallel programming can run on a computer cluster using both OpenMP and Message Passing Interface (MPI), or more transparently through the use of OpenMP extensions for non-shared memory systems.
OpenMP is an implementation of multithreading, a method of parallelizing whereby a master thread (a series of instructions executed consecutively) forks a specified number of slave threads and a task is divided among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.
The section of code that is meant to run in parallel is marked accordingly, with a preprocessor directive that will cause the threads to form before the section is executed. Each thread has an id attached to it which can be obtained using a function (called omp_get_thread_num()
). The thread id is an integer, and the master thread has an id of 0. After the execution of the parallelized code, the threads join back into the master thread, which continues onward to the end of the program.
By default, each thread executes the parallelized section of code independently. Work-sharing constructs can be used to divide a task among the threads so that each thread executes its allocated part of the code. Both task parallelism and data parallelism can be achieved using OpenMP in this way.
The runtime environment allocates threads to processors depending on usage, machine load and other factors. The number of threads can be assigned by the runtime environment based on environment variables or in code using functions. The OpenMP functions are included in a header file labelled omp.h in C/C++.
The OpenMP Architecture Review Board (ARB) published its first API specifications, OpenMP for Fortran 1.0, in October 1997. October the following year they released the C/C++ standard. 2000 saw version 2.0 of the Fortran specifications with version 2.0 of the C/C++ specifications being released in 2002. Version 2.5 is a combined C/C++/Fortran specification that was released in 2005.
Version 3.0, released in May, 2008. Included in the new features in 3.0 is the concept of tasks and the task construct. These new features are summarized in Appendix F of the OpenMP 3.0 specifications.
Version 3.1 of the OpenMP specification was released July 9, 2011.
The core elements of OpenMP are the constructs for thread creation, workload distribution (work sharing), data-environment management, thread synchronization, user-level runtime routines and environment variables.
In C/C++, OpenMP uses #pragmas. The OpenMP specific pragmas are listed below:
omp parallel. It is used to fork additional threads to carry out the work enclosed in the construct in parallel. The original process will be denoted as master thread with thread ID 0.
Example (C program): Display "Hello, world" using multiple threads.
<source lang=c>
int main(void) {
#pragma omp parallel printf("Hello, world.\n"); return 0;
} </source>
Use flag -fopenmp to compile using GCC: <source lang=bash> $gcc -fopenmp hello.c -o hello </source>
Output on a computer with 2 Cores and 2 threads.
<source lang=bash> Hello, world. Hello, world. </source>
However, the output may also be garbled because of the race condition caused from the two threads sharing the standard output. <source lang=bash> Hello, wHello, woorld. rld. </source>
used to specify how to assign independent work to one or all of the threads.
Example: initialize the value of a large array in parallel, using each thread to do part of the work
<source lang=c>int main(int argc, char *argv[]) {
const int N = 100000; int i, a[N];
#pragma omp parallel for for (i = 0; i < N; i++) a[i] = 2 * i;
return 0;
}</source>
Since OpenMP is a shared memory programming model, most variables in OpenMP code are visible to all threads by default. But sometimes private variables are necessary to avoid race conditions and there is a need to pass values between the sequential part and the parallel region (the code block executed in parallel), so data environment management is introduced as data sharing attribute clauses by appending them to the OpenMP directive. The different types of clauses are
Used to modify/check the number of threads, detect if the execution context is in a parallel region, how many processors in current system, set/unset locks, timing functions, etc.
A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example OMP_NUM_THREADS is used to specify number of threads for an application.
![]() |
This article may contain excessive, poor or irrelevant examples. You can improve the article by adding more descriptive text and removing less pertinent examples. See Wikipedia's guide to writing better articles for further suggestions. (April 2011) |
In this section, some sample programs are provided to illustrate the concepts explained above.
Hello World is a basic program, that exercises the parallel, private and barrier directives, and the functions omp_get_thread_num
and omp_get_num_threads
(not to be confused).
This C program can be compiled using gcc-4.4 with the flag -fopenmp <source lang=c>
int main (int argc, char *argv[]) {
int th_id, nthreads; #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("Hello World from thread %d\n", th_id); #pragma omp barrier if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } return EXIT_SUCCESS;
} </source>
This C++ program can be compiled using GCC: gcc -Wall -fopenmp test.cpp -lstdc++
NOTE: The IOstreams library is not thread-safe. Therefore, for instance, cout calls must be executed in critical areas or by only one thread (e.g. masterthread). <source lang=cpp>
using namespace std;
int main(int argc, char *argv[]) {
int th_id, nthreads; #pragma omp parallel private(th_id) shared(nthreads) { th_id = omp_get_thread_num(); #pragma omp critical { cout << "Hello World from thread " << th_id << '\n'; } #pragma omp barrier
#pragma omp master { nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << '\n'; } }
return 0;
} </source>
Here is a Fortran 77 version. <source lang=fortran>
PROGRAM HELLO INTEGER ID, NTHRDS INTEGER OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS
C$OMP PARALLEL PRIVATE(ID)
ID = OMP_GET_THREAD_NUM() PRINT *, 'HELLO WORLD FROM THREAD', ID
C$OMP BARRIER
IF ( ID .EQ. 0 ) THEN NTHRDS = OMP_GET_NUM_THREADS() PRINT *, 'THERE ARE', NTHRDS, 'THREADS' END IF
C$OMP END PARALLEL
END
</source>
Here is a Fortran 90 free form version. <source lang=fortran>
program hello90 use omp_lib integer:: id, nthreads !$omp parallel private(id) id = omp_get_thread_num() write (*,*) 'Hello World from thread', id !$omp barrier if ( id == 0 ) then nthreads = omp_get_num_threads() write (*,*) 'There are', nthreads, 'threads' end if !$omp end parallel end program
</source>
The application of some OpenMP clauses are illustrated in the simple examples in this section. The piece of code below updates the elements of an array b by performing a simple operation on the elements of an array a. The parallelization is done by the OpenMP directive #pragma omp. The scheduling of tasks is dynamic. Notice how the iteration counters j and k have to be made private, whereas the primary iteration counter i is private by default. The task of running through i is divided among multiple threads, and each thread creates its own versions of j and k in its execution stack, thus doing the full task allocated to it and updating the allocated part of the array b at the same time as the other threads.
<source lang=c>
#define CHUNKSIZE 1 /*defines the chunk size as 1 contiguous iteration*/ /*forks off the threads*/ #pragma omp parallel private(j,k) { /*Starts the work sharing construct*/ #pragma omp for schedule(dynamic, CHUNKSIZE) for(i = 2; i <= N-1; i++) for(j = 2; j <= i; j++) for(k = 1; k <= M; k++) b[i][j] += a[i-1][j]/k + a[i+1][j]/k; }
</source>
The next piece of code is a common usage of the reduction clause to calculate reduced sums. Here, we add up all the elements of an array a with an i dependent weight using a for-loop which we parallelize using OpenMP directives and reduction clause. The scheduling is kept static.
<source lang=c>
#define N 10000 /*size of a*/ void calculate(long *); /*The function that calculates the elements of a*/ int i; long w; long a[N]; calculate(a); long sum = 0; /*forks off the threads and starts the work-sharing construct*/ #pragma omp parallel for private(w) reduction(+:sum) schedule(static,1) for(i = 0; i < N; i++) { w = i*i; sum = sum + w*a[i]; } printf("\n %li",sum);
</source>
An equivalent, less elegant, implementation of the above code is to create a local sum variable for each thread ("loc_sum"), and make a protected update of the global variable sum at the end of the process, through the directive critical. Note that this protection is crucial, as explained elsewhere.
<source lang=c>
... long sum = 0, loc_sum; /*forks off the threads and starts the work-sharing construct*/ #pragma omp parallel private(w,loc_sum) { loc_sum = 0; #pragma omp for schedule(static,1) for(i = 0; i < N; i++) { w = i*i; loc_sum = loc_sum + w*a[i]; } #pragma omp critical sum = sum + loc_sum; } printf("\n %li",sum);
</source>
OpenMP has been implemented in many commercial compilers. For instance, Visual C++ 2005, 2008 and 2010 support it (in their Professional, Team System, Premium and Ultimate editions[9][10][11]), as well as Intel Parallel Studio for various processors.[12] Oracle Solaris Studio compilers and tools support the latest OpenMP specifications with productivity enhancements for Solaris OS (UltraSPARC and x86/x64) and Linux platforms. The Fortran, C and C++ compilers from The Portland Group also support OpenMP 2.5. GCC has also supported OpenMP since version 4.2.
A few compilers have early implementation for OpenMP 3.0, including
Sun Studio 12 update 1 has a full implementation of OpenMP 3.0.[14]
Pros
Cons
One might expect to get an N times speedup when running a program parallelized using OpenMP on a N processor platform. However, this seldom occurs for these reasons:
Some vendors recommend setting the processor affinity on OpenMP threads to associate them with particular processor cores.[18][19][20] This minimizes thread migration and context-switching cost among cores. It also improves the data locality and reduces the cache-coherency traffic among the cores (or processors).
There are some public domain OpenMP benchmarks for users to try.
|