CS-3006 5 UsingOpenMP SharedMemoryProgramming
CS-3006 5 UsingOpenMP SharedMemoryProgramming
OpenMP
• Programmer Directed
– For example, OpenMP
Shared Memory Programming
• Pthreads
• C++ Threads
• OpenMP
pthreads
C++ threads
OpenMP
Compiling
Intel (icc, ifort, icpc)
-openmp
PGI (pgcc, pgf90, …)
-mp
GNU (gcc, gfortran, g++)
-fopenmp
OpenMP - User Interface Model
• Shared Memory with thread based parallelism
Valid Clauses:
if (logical expression)
num_threads (integer)
private (list of variables)
firstprivate (list of variables)
shared (list of variables)
default (none|shared|private *fortran only*)
copyin (list of variables)
reduction (operator: list)
…
OpenMP Constructs
• OpenMP constructs can be divided into 5
categories:
1. Parallel Regions
2. Work-sharing
3. Data Environment
4. Synchronization
5. Runtime functions/environment variables
OpenMP: Parallel Regions
• You create threads in OpenMP with “omp parallel” pragma
• For example: a 4-thread based Parallel region:
int A[10];
omp_set_num_threads(4);
#pragma omp parallel Demo: helloFun.c
{
int ID =omp_get_thread_num();
fun1(ID,A);
}
$ export OMP_NUM_THREADS=4
$ echo $OMP_NUM_THREADS
int omp_in_parallel()
Demo: PRegion.c
Shared and Private Data
• Shared data are accessible by all threads
Demo: SPData.c
Shared and Private Data
#pragma omp parallel shared(list)
• Default behavior
• List will be shared
• Each thread access the same memory location
• Initial value (for the first thread) will be same as before
the region
• Final value will be updated by the last thread leaving the
region
• Problems: Data Race
Shared and Private Data
Demo: SPDE1.c
Shared and Private Data – Example (2/4)
Demo: SPDE1.c
Shared and Private Data – Example (3/4)
Demo: SPDE1.c
Shared and Private Data – Example (4/4)
Demo: SPDE1.c
Getting ID of Current Thread
int main(int argc, char* argv[])
{
int iam, nthreads;
#pragma omp parallel private(iam,nthreads)
num_threads(2)
{
iam = omp_get_thread_num();
nthreads = omp_get_num_threads();
printf(“ThradID %d, out of %d threads\n”, iam,
nthreads);
if (iam == 0)
printf(“Here is the Master Thread.\n”);
else
printf(“Here is another thread.\n”);
}
}
Demo: CTID.c
Work-Sharing Constructs
• If all the threads are doing the same thing, what is the
advantage then?
Demo: ForConst.c
Do/For Work-Sharing Construct
int main(int argc, char* argv[])
{
int sum, counter, inputList[6] = {11,45,3,5,12,-3};
#pragma omp parallel num_threads(2)
{
#pragma omp for schedule(static, 3)
for (counter=0; counter<6; counter++) {
printf("%d adding %d to the
sum\n",omp_get_thread_num(),
inputList[counter]);
sum+=inputList[counter];
} //end of for
} //end of parallel section
Demo: LoopSched.c
Guided Schedule
• Each thread also executes a chunk, and when a thread
finishes a chunk, it requests another one.
• However, in a guided schedule, as chunks are completed
the size of the new chunks decreases.
• If no chunksize is specified, the size of the chunks
decreases down to 1.
• If chunksize is specified, it decreases down to chunksize,
with the exception that the very last chunk can be smaller
than chunksize.
Demo: Guided.c
Credits: Copyright © 2010, Elsevier Inc.
Threads share Global variables!
#include <pthread.h>
#include <iostream>
#include <unistd.h>
using namespace std;
#define NUM_THREADS 10
int sharedData = 0;
void* incrementData(void* arg) {
sharedData++;
pthread_exit(NULL);
}
int main()
{
pthread_t threadID;
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_create(&threadID, NULL, incrementData, NULL);
}
cout << "ThreadCount:" << sharedData <<endl;
pthread_exit(NULL);
}
The output for the pthread version?
>./globalData
ThreadCount:10
>./globalData
ThreadCount:8
ThreadCount: A better implementation
#include <pthread.h>
#include <iostream>
#include <unistd.h>
using namespace std;
#define NUM_THREADS 100
int sharedData = 0;
void* incrementData(void* arg)
{
sharedData++;
pthread_exit(NULL); }
int main()
{
pthread_t threadID[NUM_THREADS];
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_create(&threadID[counter], NULL, incrementData, NULL);
}
//waiting for all threads
int statusReturned;
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_join(threadID[counter], NULL);
}
cout << "ThreadCount:" << sharedData <<endl;
pthread_exit(NULL);
}
Is the problem solved?
• Unfortunately, not yet :(
• The output from running it with 1000 threads is as below:
>./6join
ThreadCount:990
>./6join
ThreadCount:978
>./6join
ThreadCount:1000
>
• Reasons?
• What can be done?
ThreadCount: OpenMP Implementation
int main(int argc, char* argv[])
{
int threadCount=0;
#pragma omp parallel num_threads(100)
{
int myLocalCount = threadCount;
threadCount++;
sleep(1);
myLocalCount++;
myLocalCount++;
threadCount = myLocalCount;
threadCount = myLocalCount;
}
printf("Total Number of Threads: %d\n", threadCount);
}
Demo: TCount1.c
Critical-Section (CS) Problem
n processes all competing to use some shared data
Each process has a code segment, called critical section,
in which the shared data is accessed
Thread A
Thread B
T1 T2 T3 T4
B attempts to enter B leaves
critical section critical section
Mutual Exclusion
At any given time, only one Thread is in the critical
…back to threads counting
int sharedData = 0;
pthread_mutex_t mutexIncrement;
int main()
{
pthread_mutex_init(&mutexIncrement, NULL);
pthread_t threadID[NUM_THREADS];
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_create(&threadID[counter], NULL, incrementData, NULL);
}
//waiting for all threads
int statusReturned;
for (int counter=0; counter<NUM_THREADS;counter++) {
pthread_join(threadID[counter], NULL);
}
cout << "ThreadCount:" << sharedData <<endl;
pthread_exit(NULL);
}
OpenMP - Synchronization Constructs
• The CRITICAL directive specifies a region of code that
must be executed by only one thread at a time
Demo: TCount2.c
OpenMP - Synchronization Constructs
• The MASTER directive specifies a region that is to
be executed only by the master thread of the
team
Demo: MasterOnly.c
OpenMP - Synchronization Constructs
• When a BARRIER directive is reached, a thread will wait
at that point until all other threads have reached that
barrier
all here?
Demo: Barrier.c
Reduction (Data-sharing Attribute Clause)
• The REDUCTION clause performs a reduction operation on
the variables that appear in the list
• A private copy for each list variable is created and initialized
for each thread
• At the end of the reduction, the reduction variable (all private
copies) is examined and the shared variable’s final result is
written.
Demo: Reduction.c
Practice
• Each encountering thread creates a task
– Package code and data environment
– Can be nested
• Inside parallel regions
• Inside other tasks
• Inside worksharing
• An OpenMP barrier (implicit or explicit):
– All tasks created by any thread of the current team are guaranteed to be completed
at barrier exit.
• Data Scope Clauses
– Shared, private, default, firstprivate, lastprivate
• Task Synchronization
– Barrier, atomic
• Task barrier (taskwait):
– Encountering thread suspends until all child tasks it has generated are complete.
Serial PI program
Practice
Serial Computation of Fibonacci
Practice
• Each encountering thread creates a task
– Package code and data environment
– Can be nested
• Inside parallel regions
• Inside other tasks
• Inside worksharing
• An OpenMP barrier (implicit or explicit):
– All tasks created by any thread of the current team are guaranteed to be completed
at barrier exit.
• Data Scope Clauses
– Shared, private, default, firstprivate, lastprivate
• Task Synchronization
– Barrier, atomic
• Task barrier (taskwait):
– Encountering thread suspends until all child tasks it has generated are complete.
Any Questions?