0% found this document useful (0 votes)
39 views49 pages

OpenMP Workshop Day 1

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views49 pages

OpenMP Workshop Day 1

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Programming OpenMP

Christian Terboven
Michael Klemm

1 OpenMP Tutorial
Members of the OpenMP Language Committee
Agenda

2 OpenMP Tutorial
Members of the OpenMP Language Committee
Programming OpenMP
An Overview Of OpenMP

Christian Terboven
Michael Klemm

1 OpenMP Tutorial
Members of the OpenMP Language Committee
History

• De-facto standard for Shared-Memory Parallelization.


• 1997: OpenMP 1.0 for FORTRAN
• 1998: OpenMP 1.0 for C and C++
• 1999: OpenMP 1.1 for FORTRAN
http://www.OpenMP.org
• 2000: OpenMP 2.0 for FORTRAN
• 2002: OpenMP 2.0 for C and C++
• 2005: OpenMP 2.5 now includes
both programming languages.
• 05/2008: OpenMP 3.0
• 07/2011: OpenMP 3.1
• 07/2013: OpenMP 4.0
• 11/2015: OpenMP 4.5
• 11/2018: OpenMP 5.0
• 11/2020: OpenMP 5.1

2 OpenMP Tutorial
Members of the OpenMP Language Committee
What is OpenMP?

• Parallel Region & Worksharing

• Tasking

• SIMD / Vectorization

• Accelerator Programming

• …

3 OpenMP Tutorial
Members of the OpenMP Language Committee
Get your C/C++ and Fortran Reference Guide!
Covers all of OpenMP 5.0!

4 OpenMP Tutorial
Members of the OpenMP Language Committee
Recent Books About OpenMP

A printed copy of the 5.0 A book that covers all of the A new book about the OpenMP
specifications, 2019 OpenMP 4.5 features, 2017 Common Core, 2019

5 OpenMP Tutorial
Members of the OpenMP Language Committee
Programming OpenMP
Parallel Region

Christian Terboven
Michael Klemm

1 OpenMP Tutorial
Members of the OpenMP Language Committee
OpenMP‘s machine model

• OpenMP: Shared-Memory Parallel Programming Model.

Memory All processors/cores access


a shared main memory.

Crossbar / Bus Real architectures are


more complex, as we
will see later / as we
Cache Cache Cache have
Cache
seen.

Proc Proc Proc Proc Parallelization in OpenMP


employs multiple threads.

2 OpenMP Tutorial
Members of the OpenMP Language Committee
The OpenMP Memory Model

• All threads have access to private


memory private
the same, globally shared memory
PU
memory
T T
PU
• Data in private memory is
only accessible by the thread accelerator
owning this memory Shared T memory

Memory private
PU
• No other thread sees the private memory
change(s) in private memory memory
PU

• Data transfer is through shared


T T
memory and is 100% transparent private
memory
to the application

3 OpenMP Tutorial
Members of the OpenMP Language Committee
The OpenMP Execution Model

• OpenMP programs start with Master Thread Serial Part


just one thread: The Master.
Parallel
• Worker threads are spawned Slave Region
Slave
Threads
at Parallel Regions, together Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
• In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region

• Concept: Fork-Join.
• Allows for an incremental parallelization!

4 OpenMP Tutorial
Members of the OpenMP Language Committee
Parallel Region and Structured Blocks

• The parallelism has to be expressed explicitly.


C/C++ Fortran
#pragma omp parallel !$omp parallel
{ ...
... structured block
structured block ...
... !$omp end parallel
}

• Structured Block ◼ Specification of number of threads:


– Exactly one entry point at the top – Environment variable: OMP_NUM_THREADS=…
– Exactly one exit point at the bottom – Or: Via num_threads clause:
– Branching in or out is not allowed add num_threads(num) to the
– Terminating the program is allowed parallel construct
(abort / exit)

5 OpenMP Tutorial
Members of the OpenMP Language Committee
Starting OpenMP Programs on Linux

• From within a shell, global setting of the number of threads:


export OMP_NUM_THREADS=4
./program

• From within a shell, one-time setting of the number of threads:


OMP_NUM_THREADS=4 ./program

6 OpenMP Tutorial
Members of the OpenMP Language Committee
Demo

Hello OpenMP World

7 OpenMP Tutorial
Members of the OpenMP Language Committee
Programming OpenMP
Worksharing

Christian Terboven
Michael Klemm

1 OpenMP Tutorial
Members of the OpenMP Language Committee
For Worksharing

• If only the parallel construct is used, each thread executes the Structured Block.
• Program Speedup: Worksharing
• OpenMP‘s most common Worksharing construct: for

C/C++ Fortran
int i; INTEGER :: i
#pragma omp for !$omp do
for (i = 0; i < 100; i++) DO i = 0, 99
{ a[i] = b[i] + c[i]
a[i] = b[i] + c[i]; END DO
}
– Distribution of loop iterations over all threads in a Team.
– Scheduling of the distribution can be influenced.

• Loops often account for most of a program‘s runtime!

2 OpenMP Tutorial
Members of the OpenMP Language Committee
Worksharing illustrated

Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do

3 OpenMP Tutorial
Members of the OpenMP Language Committee
The Barrier Construct

• OpenMP barrier (implicit or explicit)


– Threads wait until all threads of the current Team have reached the barrier
C/C++
#pragma omp barrier

• All worksharing constructs contain an implicit barrier at the end

4 OpenMP Tutorial
Members of the OpenMP Language Committee
The Single Construct
C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single

• The single construct specifies that the enclosed structured block is executed by only on thread of the
team.
– It is up to the runtime which thread that is.

• Useful for:
– I/O
– Memory allocation and deallocation, etc. (in general: setup work)
– Implementation of the single-creator parallel-executor pattern as we will see later…

5 OpenMP Tutorial
Members of the OpenMP Language Committee
The Master Construct
C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]
... structured block ... ... structured block ...
!$omp end master

• The master construct specifies that the enclosed structured block is executed only by the master thread of
a team.

• Note: The master construct is no worksharing construct and does not contain an implicit barrier at the end.

6 OpenMP Tutorial
Members of the OpenMP Language Committee
Demo

Vector Addition

7 OpenMP Tutorial
Members of the OpenMP Language Committee
Influencing the For Loop Scheduling / 1

• for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of the
team, via the schedule clause:

– schedule(static [, chunk]): Iteration space divided into blocks of chunk size, blocks are assigned to
threads in a round-robin fashion. If chunk is not specified: #threads blocks.

– schedule(dynamic [, chunk]): Iteration space divided into blocks of chunk (not specified: 1) size,
blocks are scheduled to threads in the order in which threads finish previous blocks.

– schedule(guided [, chunk]): Similar to dynamic, but block size starts with implementation-defined
value, then is decreased exponentially down to chunk.

• Default is schedule(static).

8 OpenMP Tutorial
Members of the OpenMP Language Committee
Influencing the For Loop Scheduling / 2

◼ Static Schedule
→ schedule(static [, chunk])

→ Decomposition
depending on chunksize

→ Equal parts of size ‘chunksize’


distributed in round-robin
fashion
◼ Pros?
→ No/low runtime overhead
◼ Cons?
→ No dynamic workload balancing

9 OpenMP Tutorial
Members of the OpenMP Language Committee
Influencing the For Loop Scheduling / 3

• Dynamic schedule
– schedule(dynamic [, chunk])
– Iteration space divided into blocks of chunk size
– Threads request a new block after finishing the previous one
– Default chunk size is 1
• Pros ?
– Workload distribution
• Cons?
– Runtime Overhead
– Chunk size essential for performance
– No NUMA optimizations possible

10 OpenMP Tutorial
Members of the OpenMP Language Committee
Synchronization Overview

• Can all loops be parallelized with for-constructs? No!


– Simple test: If the results differ when the code is executed backwards, the loop iterations are not independent.
BUT: This test alone is not sufficient:

C/C++
int i, int s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
s = s + a[i];
}

• Data Race: If between two synchronization points at least one thread writes to a memory location from
which at least one other thread reads, the result is not deterministic (race condition).

11 OpenMP Tutorial
Members of the OpenMP Language Committee
Synchronization: Critical Region

• A Critical Region is executed by all threads, but by only one thread simultaneously (Mutual Exclusion).

C/C++
#pragma omp critical (name)
{
... structured block ...
}

• Do you think this solution scales well?


C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}

12 OpenMP Tutorial
Members of the OpenMP Language Committee
Programming OpenMP
Scoping

Christian Terboven
Michael Klemm

1 OpenMP Tutorial
Members of the OpenMP Language Committee
Scoping Rules

• Managing the Data Environment is the challenge of OpenMP.

• Scoping in OpenMP: Dividing variables in shared and private:


– private-list and shared-list on Parallel Region
– private-list and shared-list on Worksharing constructs
– General default is shared for Parallel Region, firstprivate for Tasks.
– Loop control variables on for-constructs are private Tasks are
introduced later
– Non-static variables local to Parallel Regions are private
– private: A new uninitialized instance is created for the task or each thread executing the construct
• firstprivate: Initialization with the value before encountering the construct
• lastprivate: Value of last loop iteration is written back to Master
– Static variables are shared

2 OpenMP Tutorial
Members of the OpenMP Language Committee
Privatization of Global/Static Variables

• Global / static variables can be privatized with the threadprivate directive


– One instance is created for each thread
• Before the first parallel region is encountered
• Instance exists until the program ends
• Does not work (well) with nested Parallel Region
– Based on thread-local storage (TLS)
• TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword __thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

3 OpenMP Tutorial
Members of the OpenMP Language Committee
Privatization of Global/Static Variables

• Global / static variables can be privatized with the threadprivate directive


– One instance is created for each thread
• Before the first parallel region is encountered
• Instance exists until the program ends
• Does not work (well) with nested Parallel Region
– Based on thread-local storage (TLS)
• TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword __thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

4 OpenMP Tutorial
Members of the OpenMP Language Committee
Back to our example

C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}

5 OpenMP Tutorial
Members of the OpenMP Language Committee
It‘s your turn: Make It Scale!

#pragma omp parallel


do i = 0, 24
{ s = s + a(i)
end do

#pragma omp for do i = 25, 49


for (i = 0; i < 99; i++) s = s + a(i)
{ end do
do i = 0, 99
s = s + a(i)
do i = 50, 74
s = s + a[i]; end do
s = s + a(i)
end do
}
do i = 75, 99
s = s + a(i)
} // end parallel end do

6 OpenMP Tutorial
Members of the OpenMP Language Committee
(done)

#pragma omp parallel


do i = 0, 24
{ s1 = s1 + a(i)
double ps = 0.0; // private variable end do
s = s + s1
#pragma omp for
for (i = 0; i < 99; i++) do i = 25, 49
s2 = s2 + a(i)
{ end do
ps = ps + a[i]; do i = 0, 99 s = s + s2
} s = s + a(i)
do i = 50, 74
#pragma omp critical end do
s3 = s3 + a(i)
{ end do
s = s + s3
s += ps;
do i = 75, 99
} s4 = s4 + a(i)
} // end parallel end do
s = s + s4

7 OpenMP Tutorial
Members of the OpenMP Language Committee
The Reduction Clause

• In a reduction-operation the operator is applied to all variables in the list. The variables have to be shared.
– reduction(operator:list)
– The result is provided in the associated reduction variable

C/C++
int i, s = 0;
#pragma omp parallel for reduction(+:s)
for(i = 0; i < 99; i++)
{
s = s + a[i];
}

– Possible reduction operators with initialization value:


+ (0), * (1), - (0), & (~0), | (0), && (1), || (0), ^ (0), min
(largest number), max (least number)
– Remark: OpenMP also supports user-defined reductions (not covered here)

8 OpenMP Tutorial
Members of the OpenMP Language Committee
Example

PI

9 OpenMP Tutorial
Members of the OpenMP Language Committee
Example: Pi (1/2)

double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=න
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;

#pragma omp parallel for


for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

10 OpenMP Tutorial
Members of the OpenMP Language Committee
Example: Pi (2/2)

double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=න
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;

#pragma omp parallel for private(fX,i) reduction(+:fSum)


for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

11 OpenMP Tutorial
Members of the OpenMP Language Committee
Programming OpenMP
OpenMP Tasking Introduction

Christian Terboven
Michael Klemm

1 OpenMP Tutorial
Members of the OpenMP Language Committee
What is a Task in OpenMP?
◼ Tasks are work units whose execution
→ may be deferred or…

→ … can be executed immediately


◼ Tasks are composed of
→ code to execute, a data environment (initialized at creation time), internal control variables (ICVs)
◼ Tasks are created…
… when reaching a parallel region → implicit tasks are created (per thread)

… when encountering a task construct → explicit task is created

… when encountering a taskloop construct → explicit tasks per chunk are created

… when encountering a target construct → target task is created

2 OpenMP Tutorial
Members of the OpenMP Language Committee
Tasking Execution Model
◼ Supports unstructured parallelism ◼ Example (unstructured parallelism)
→ unbounded loops #pragma omp parallel
#pragma omp master
while ( <expr> ) {
while (elem != NULL) {
...
#pragma omp task
}
compute(elem);
elem = elem->next;
→ recursive functions }
void myfunc( <args> )
{
...; myfunc( <newargs> ); ...;
} Parallel Team

◼ Several scenarios are possible: Task pool

→ single creator, multiple creators, nested tasks (tasks & WS)


◼ All threads in the team are candidates to execute tasks

3 OpenMP Tutorial
Members of the OpenMP Language Committee
OpenMP Tasking Idiom
◼ OpenMP programmers need a specific idiom to kick off task-parallel execution: parallel master
→ OpenMP version 5.0 introduced the parallel master construct

→ With OpenMP version 5.1 this becomes parallel masked

1 int main(int argc, char* argv[]) 1 int main(int argc, char* argv[])
2 { 2 {
3 [...] 3 [...]
4 #pragma omp parallel 4 #pragma omp parallel
5 { 5 {
6 #pragma omp master 6 #pragma omp single
7 { 7 {
9 start_task_parallel_execution(); 9 start_task_parallel_execution();
9 } 9 }
10 } 10 }
11 [...] 11 [...]
12 } 12 }

4 OpenMP Tutorial
Members of the OpenMP Language Committee
Fibonacci Numbers (in a Stupid Way ☺)
1 int main(int argc, 14 int fib(int n) {
2 char* argv[]) 15 if (n < 2) return n;
3 { 16 int x, y;
4 [...] 17 #pragma omp task shared(x)
5 #pragma omp parallel 18 {
6 { 19 x = fib(n - 1);
7 #pragma omp master 20 }
8 { 21 #pragma omp task shared(y)
9 fib(input); 22 {
10 } 23 y = fib(n - 2);
11 } 24 }
12 [...] 25 #pragma omp taskwait
13 } 26 return x+y;
27 }

◼ Only one thread enters fib() from main().


◼ That thread creates the two initial work tasks and starts the parallel recursion.
◼ The taskwait construct is required to wait for the result for x and y before the task can sum up.
5 OpenMP Tutorial
Members of the OpenMP Language Committee
◼ T1 enters fib(4)
◼ T1 creates tasks for
fib(3) and fib(2) fib(4)
◼ T1 and T2 execute tasks
from the queue
◼ T1 and T2 create 4 new fib(3) fib(2)
tasks
◼ T1 - T4 execute tasks

fib(2) fib(1) fib(1) fib(0)

Task Queue

fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)

6 OpenMP Tutorial
Members of the OpenMP Language Committee
◼ T1 enters fib(4)
◼ T1 creates tasks for
fib(3) and fib(2) fib(4)
◼ T1 and T2 execute tasks
from the queue
◼ T1 and T2 create 4 new fib(3) fib(2)
tasks
◼ T1 - T4 execute tasks
◼ … fib(2) fib(1) fib(1) fib(0)

fib(1) fib(0)

7 OpenMP Tutorial
Members of the OpenMP Language Committee
Programming OpenMP
Using OpenMP Compilers

Christian Terboven
Michael Klemm

8 OpenMP Tutorial
Members of the OpenMP Language Committee
Production Compilers w/ OpenMP Support
◼ GCC
◼ clang/LLVM
◼ Intel Classic and Next-gen Compilers
◼ AOCC, AOMP, ROCmCC
◼ IBM XL
◼ … and many more

◼ See https://www.openmp.org/resources/openmp-compilers-tools/ for a list

9 OpenMP Tutorial
Members of the OpenMP Language Committee
Compiling OpenMP
◼ Enable OpenMP via the compiler’s command-line switches
→ GCC: -fopenmp

→ clang: -fopenmp

→ Intel: -fopenmp or –qopenmp (classic) or –fiopenmp (next-gen)

→ AOCC, AOCL, ROCmCC: -fopenmp

→ IBM XL: -qsmp=omp


◼ Switches have to be passed to both compiler and linker:
$ gcc [...] -fopenmp -o matmul.o -c matmul.c
$ gcc [...] -fopenmp -o matmul matmul.o
$./matmul 1024
Sum of matrix (serial): 134217728.000000, wall time 0.413975, speed-up 1.00
Sum of matrix (parallel): 134217728.000000, wall time 0.092162, speed-up 4.49

10 OpenMP Tutorial
Members of the OpenMP Language Committee
Programming OpenMP
Hands-on Exercises

Christian Terboven
Michael Klemm

11 OpenMP Tutorial
Members of the OpenMP Language Committee
Webinar Exercises
◼ We have implemented a series of small hands-on examples that you can use and play with.
→ Download: git clone https://github.com/cterboven/OpenMP-tutorial-PRACE.git

→ Build: make

→ You can then find the compiled code in the “bin” folder to run it

→ We use the GCC compiler mostly, some examples require Intel’s Math Kernel Library

◼ Each hands-on exercise has a folder “solution”


→ It shows the OpenMP directive that we have added

→ You can use it to cheat ☺, or to check if you came up with the same solution

12 OpenMP Tutorial
Members of the OpenMP Language Committee

You might also like