0% found this document useful (0 votes)
5 views

CS-3006 5 UsingOpenMP SharedMemoryProgramming

The document provides an overview of OpenMP, a standard for parallel programming in shared memory systems, emphasizing its goals, syntax, and execution model. It discusses the use of compiler directives for parallelism, the management of shared and private data, and various constructs for work-sharing and synchronization. Additionally, it highlights the importance of thread management and the critical section problem in concurrent programming.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

CS-3006 5 UsingOpenMP SharedMemoryProgramming

The document provides an overview of OpenMP, a standard for parallel programming in shared memory systems, emphasizing its goals, syntax, and execution model. It discusses the use of compiler directives for parallelism, the management of shared and private data, and various constructs for work-sharing and synchronization. Additionally, it highlights the importance of thread management and the critical section problem in concurrent programming.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

Shared Memory Parallel Systems -

OpenMP

Dr. Muhammad Mateen Yaqoob,

Department of AI & DS,


National University of Computer & Emerging Sciences,
Islamabad Campus
System Architecture
Sequential Program Execution
Parallel computing: Shared Memory Model
OpenMP
Goals
• Standardization
– Provide a standard among a variety of shared
memory architectures (platforms)
• High-level interfaces to thread programming
• Multi-vendor support
• Multi-OS support (Unix, Windows, Mac, etc.)
• The MP in OpenMP is for Multi-Processing
• Don’t confuse OpenMP with Open MPI! :)
Release History
Programming Shared Memory Systems
• Explicit Parallelism
– For example, pthreads

• Programmer Directed
– For example, OpenMP
Shared Memory Programming

• Pthreads
• C++ Threads
• OpenMP
pthreads
C++ threads
OpenMP
Compiling
Intel (icc, ifort, icpc)
-openmp
PGI (pgcc, pgf90, …)
-mp
GNU (gcc, gfortran, g++)
-fopenmp
OpenMP - User Interface Model
• Shared Memory with thread based parallelism

• Not a new language

• Compiler directives, library calls, and environment


variables extend the base language
– f77, f90, f95, C, C++

• Not automatic parallelization


– User explicitly specifies parallelism
– NOTE: Compiler does not ignore user directives even if
wrong
OpenMP - Syntax
• Parallelism is highlighted using compiler directives
or pragmas

• For C and C++, the pragmas take the form:


#pragma omp construct [clause [clause]…]

• Any compiler (even if it does not have OpenMP support)


can compile the program (with no parallelism though)
Fork*/Join Execution Model
• An OpenMP program starts as a single thread (master thread).
• Additional threads (Team) are created when the master hits a
parallel region.
• When all threads finished the parallel region, the new threads are
given back to the runtime or operating system

*Not to be confused with fork() system call


Using OpenMP
• OpenMP is usually used to parallelize loops:
– Find most time consuming loops
– Split them among threads

Split-up this loop between multiple threads


void main( ) void main( )
{ {
double Res[1000]; double Res[1000];
#pragma omp parallel for
for(int i=0;i<1000;i++) { for(int i=0;i<1000;i++) {
do_huge_comp(Res[i]); do_huge_comp(Res[i]);
} }
} }
Sequential program Parallel program
OpenMP Directives
OpenMP - Directives
• OpenMP compiler directives are used for various
purposes:
– Spawning a parallel region
– Dividing blocks of code among threads
– Distributing loop iterations between threads
–…

sentinel directive-name [clause, ...]

#pragma omp parallel private(var)


Supported Clauses for the Parallel Construct

Valid Clauses:
if (logical expression)
num_threads (integer)
private (list of variables)
firstprivate (list of variables)
shared (list of variables)
default (none|shared|private *fortran only*)
copyin (list of variables)
reduction (operator: list)

OpenMP Constructs
• OpenMP constructs can be divided into 5
categories:
1. Parallel Regions
2. Work-sharing
3. Data Environment
4. Synchronization
5. Runtime functions/environment variables
OpenMP: Parallel Regions
• You create threads in OpenMP with “omp parallel” pragma
• For example: a 4-thread based Parallel region:
int A[10];
omp_set_num_threads(4);
#pragma omp parallel Demo: helloFun.c
{
int ID =omp_get_thread_num();
fun1(ID,A);
}

• Implicit barrier at the end of parallel block


• Each thread calls fun1(ID,A) for ID = 0 to 3
• Each thread executes the same code within the block
Credits: University of Houston
The parallel directive
• A parallel region is a block of code that will be executed by
multiple threads
• When (in serial program) a PARALLEL directive is found, a
team of threads is created and main-thread (serial
execution thread) becomes the master of the team
• Master thread has id or number 0 (within that team)
• The code is duplicated and all threads will execute that
code
• There is an implicit barrier at the end of a parallel region
• Master thread continues execution after this point
The parallel directive
• Some common clauses include:
– if (expression)
– private (list)
– shared (list)
– num_threads (integer-expression)
How Many Threads?
• The number of threads in a parallel region is determined
by the following factors, in order of precedence:
1. Evaluation of the if clause
2. Setting of the NUM_THREADS clause
3. Use of the omp_set_num_threads( ) library function
4. Setting of the OMP_NUM_THREADS environment
variable
5. Implementation default: Usually the number of CPUs on
a node

• Threads are numbered from 0 (master thread) to N-1


IF clause

• Execute in parallel if expression is true


• Otherwise serial execution
NUM_THREADS clause

#pragma omp parallel if(np>1)


num_threads(np)
{

}

• Execute in parallel if expression is true


• Executes using np number of threads
omp_set_num_threads( ) function
#define TOTAL_THREADS 8
int main( )
{
omp_set_num_threads(TOTAL_THREADS);
#pragma omp parallel
{
. . .
}
. . .

• Execute in parallel using 8 threads


OMP_NUM_THREADS – Environment Variable

$ export OMP_NUM_THREADS=4
$ echo $OMP_NUM_THREADS

• Sets and displays the value of the environment


variable OMP_NUM_THREADS
Execution Status in Parallel Region

int omp_in_parallel()

• Returns non-zero: if execution is in parallel region


• Returns zero: if execution in non-parallel region

Demo: PRegion.c
Shared and Private Data
• Shared data are accessible by all threads

• A reference a[5] to a shared array accesses the


same address in all threads

• Private data are accessible only by a thread

– Each thread has its own copy

• The default is shared


Shared and Private Data
int main(int argc, char* argv[])
{
int threadData = 10;

// Beginning of parallel region


#pragma omp parallel private(threadData)
{
threadData =200;
}

// Ending of parallel region


printf("Value: %d\n", threadData);
}

Demo: SPData.c
Shared and Private Data
#pragma omp parallel shared(list)

• Default behavior
• List will be shared
• Each thread access the same memory location
• Initial value (for the first thread) will be same as before
the region
• Final value will be updated by the last thread leaving the
region
• Problems: Data Race
Shared and Private Data

• Data local to thread


• You should not rely on any initial and terminal value
(after execution of the parallel region)
• Separate “Stack Memory” for each thread’s private data
• No storage associated with original object (even with
same name for data-items)

• Use firstprivate and/or lastprivate clause to override


Shared and Private Data

• Variables in list are private


• Initialized with the value the variable had before entering
the construct

• Used in “for” loops


• Variables in list are private
• The thread that executes the final iteration of the loop
Shared and Private Data
#pragma omp parallel default (private) shared(list)
#pragma omp parallel default (shared) private(list)
#pragma omp parallel default (none) private(list)
shared(list)

• Alter the default behavior


• To implement customized access behavior
Shared and Private Data – Example (1/4)

Demo: SPDE1.c
Shared and Private Data – Example (2/4)

Demo: SPDE1.c
Shared and Private Data – Example (3/4)

Demo: SPDE1.c
Shared and Private Data – Example (4/4)

Demo: SPDE1.c
Getting ID of Current Thread
int main(int argc, char* argv[])
{
int iam, nthreads;
#pragma omp parallel private(iam,nthreads)
num_threads(2)
{
iam = omp_get_thread_num();
nthreads = omp_get_num_threads();
printf(“ThradID %d, out of %d threads\n”, iam,
nthreads);
if (iam == 0)
printf(“Here is the Master Thread.\n”);
else
printf(“Here is another thread.\n”);
}

}
Demo: CTID.c
Work-Sharing Constructs
• If all the threads are doing the same thing, what is the
advantage then?

• Within each “Team” threads are assigned IDs, with master


thread assigned ID 0
– omp_get_thread_num() //to get thread number

Can we use this to distribute tasks amongst the


“team” members?

• Work-sharing constructs distribute the specified work to


all threads within the current team
For Work-Sharing Construct
• for shares iterations of a loop across the team

#pragma omp for [clause ...] newline

There is an implicit synchronization after


#pragma omp for
Loop work-sharing visualized
For Work-Sharing Construct
• SCHEDULE clause describes how iterations of the
loop are divided among the threads in the team

Chunks of specified size assigned round-robin

Chunks of specified size are assigned when thread finishes


previous chunk (work-Stealing mechanism)
Do/For Work-Sharing Construct
int main(int argc, char* argv[])
{
int i, a[10];
#pragma omp parallel num_threads(2)
{
#pragma omp for schedule(static, 2)
for ( i=0; i<10;i++)
a[i] = omp_get_thread_num();
}

for ( i=0; i<10;i++)


printf("%d",a[i]);
}

Demo: ForConst.c
Do/For Work-Sharing Construct
int main(int argc, char* argv[])
{
int sum, counter, inputList[6] = {11,45,3,5,12,-3};
#pragma omp parallel num_threads(2)
{
#pragma omp for schedule(static, 3)
for (counter=0; counter<6; counter++) {
printf("%d adding %d to the
sum\n",omp_get_thread_num(),
inputList[counter]);

sum+=inputList[counter];
} //end of for
} //end of parallel section

printf("The summed up Value: %d", sum);


}
Demo: ForConst2.c
For Work-Sharing –Synchronized
For Work-Sharing – Non Synchronized
Problems with Static Scheduling
• What happens if loop iterations do not take the same
amount of time?
 Load imbalance
Dynamic Scheduling
• Fixed size chunks assigned on the fly
• Work-stealing mechanism

• Disadvantage: more overhead as compared to Static

Demo: LoopSched.c
Guided Schedule
• Each thread also executes a chunk, and when a thread
finishes a chunk, it requests another one.
• However, in a guided schedule, as chunks are completed
the size of the new chunks decreases.
• If no chunksize is specified, the size of the chunks
decreases down to 1.
• If chunksize is specified, it decreases down to chunksize,
with the exception that the very last chunk can be smaller
than chunksize.

Credits: Copyright © 2010, Elsevier Inc.


Ordered Clause
• Must appear within the context of a
– omp for
– omp parallel for

• Executed in the same order in which iterations are


executed in a sequential loop

Demo: Guided.c
Credits: Copyright © 2010, Elsevier Inc.
Threads share Global variables!
#include <pthread.h>
#include <iostream>
#include <unistd.h>
using namespace std;
#define NUM_THREADS 10

int sharedData = 0;
void* incrementData(void* arg) {
sharedData++;
pthread_exit(NULL);
}

int main()
{
pthread_t threadID;
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_create(&threadID, NULL, incrementData, NULL);
}
cout << "ThreadCount:" << sharedData <<endl;
pthread_exit(NULL);
}
The output for the pthread version?

>./globalData
ThreadCount:10

>./globalData
ThreadCount:8
ThreadCount: A better implementation
#include <pthread.h>
#include <iostream>
#include <unistd.h>
using namespace std;
#define NUM_THREADS 100
int sharedData = 0;
void* incrementData(void* arg)
{
sharedData++;
pthread_exit(NULL); }

int main()
{
pthread_t threadID[NUM_THREADS];
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_create(&threadID[counter], NULL, incrementData, NULL);
}
//waiting for all threads
int statusReturned;
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_join(threadID[counter], NULL);
}
cout << "ThreadCount:" << sharedData <<endl;
pthread_exit(NULL);
}
Is the problem solved?
• Unfortunately, not yet :(
• The output from running it with 1000 threads is as below:

>./6join
ThreadCount:990
>./6join
ThreadCount:978
>./6join
ThreadCount:1000
>

• Reasons?
• What can be done?
ThreadCount: OpenMP Implementation
int main(int argc, char* argv[])
{
int threadCount=0;
#pragma omp parallel num_threads(100)
{
int myLocalCount = threadCount;
threadCount++;
sleep(1);
myLocalCount++;
myLocalCount++;
threadCount = myLocalCount;
threadCount = myLocalCount;

}
printf("Total Number of Threads: %d\n", threadCount);
}

Demo: TCount1.c
Critical-Section (CS) Problem
 n processes all competing to use some shared data
 Each process has a code segment, called critical section,
in which the shared data is accessed

 Problem (ensures that):


– Two process are not allowed to execute in their critical
section at the same time
– Access to the critical section must be an atomic action
Critical Section
A leaves critical section
A enters critical section

Thread A

B enters critical section


B blocked

Thread B

T1 T2 T3 T4
B attempts to enter B leaves
critical section critical section

Mutual Exclusion
At any given time, only one Thread is in the critical
…back to threads counting
int sharedData = 0;
pthread_mutex_t mutexIncrement;

void* incrementData(void* arg)


{
pthread_mutex_lock(&mutexIncrement);
sharedData++;
pthread_mutex_unlock(&mutexIncrement);
pthread_exit(NULL);
}

int main()
{
pthread_mutex_init(&mutexIncrement, NULL);

pthread_t threadID[NUM_THREADS];
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_create(&threadID[counter], NULL, incrementData, NULL);
}
//waiting for all threads
int statusReturned;
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_join(threadID[counter], NULL);
}
cout << "ThreadCount:" << sharedData <<endl;
pthread_exit(NULL);
}
OpenMP - Synchronization Constructs
• The CRITICAL directive specifies a region of code that
must be executed by only one thread at a time

• If a thread is currently executing inside a CRITICAL region


and another thread attempts to execute it, it will block
until the first thread exits that CRITICAL region.

#pragma omp critical [ name ]



… back to threadCount
int main(int argc, char* argv[])
{
int threadCount;
#pragma omp parallel num_threads(5)
{
#pragma omp critical
{
int myLocalCount = threadCount;
sleep(1);
myLocalCount++;
threadCount = myLocalCount;
}
}
printf("Total Number of Threads: %d\n", threadCount);
}

Demo: TCount2.c
OpenMP - Synchronization Constructs
• The MASTER directive specifies a region that is to
be executed only by the master thread of the
team

• All other threads on the team skip this section of


code

#pragma omp master


Demo: MasterOnly.c
OpenMP - Synchronization Constructs
• When a BARRIER directive is reached, a thread will wait
at that point until all other threads have reached that
barrier

• All threads then resume executing in parallel the code


that follows the barrier.

#pragma omp barrier



Barrier Synchronization

all here?

Demo: Barrier.c
Reduction (Data-sharing Attribute Clause)
• The REDUCTION clause performs a reduction operation on
the variables that appear in the list
• A private copy for each list variable is created and initialized
for each thread
• At the end of the reduction, the reduction variable (all private
copies) is examined and the shared variable’s final result is
written.

#pragma omp operator: list



operator can be +,-,*,&&,||,max,min …
Reduction (Data-sharing Attribute Clause)
int main(int argc, char* argv[])
{
srand(time(NULL));
int winner;
#pragma omp parallel reduction(max:winner) num_threads(10)
{
winner = (rand() % 1000) + omp_get_thread_num();
printf("Thread: %d has Chosen: %d\n",
omp_get_thread_num(),winner);
}
printf("Winner: %d\n", winner);

Demo: Reduction.c
Practice
• Each encountering thread creates a task
– Package code and data environment
– Can be nested
• Inside parallel regions
• Inside other tasks
• Inside worksharing
• An OpenMP barrier (implicit or explicit):
– All tasks created by any thread of the current team are guaranteed to be completed
at barrier exit.
• Data Scope Clauses
– Shared, private, default, firstprivate, lastprivate
• Task Synchronization
– Barrier, atomic
• Task barrier (taskwait):
– Encountering thread suspends until all child tasks it has generated are complete.
Serial PI program
Practice
Serial Computation of Fibonacci
Practice
• Each encountering thread creates a task
– Package code and data environment
– Can be nested
• Inside parallel regions
• Inside other tasks
• Inside worksharing
• An OpenMP barrier (implicit or explicit):
– All tasks created by any thread of the current team are guaranteed to be completed
at barrier exit.
• Data Scope Clauses
– Shared, private, default, firstprivate, lastprivate
• Task Synchronization
– Barrier, atomic
• Task barrier (taskwait):
– Encountering thread suspends until all child tasks it has generated are complete.
Any Questions?

You might also like