0% found this document useful (0 votes)
16 views38 pages

Open MP3

The document discusses OpenMP environment variables and synchronization constructs. It reviews the OpenMP clauses like private, shared, default, reduction, if, and schedule. It describes synchronization directives like barrier, single, and master. Critical sections and the atomic directive are presented as ways to avoid race conditions when updating shared variables. Finally, environment variables for OpenMP like OMP_NUM_THREADS, OMP_DYNAMIC, OMP_SCHEDULE, and OMP_NESTED are explained. An example of calculating Pi using the Monte Carlo method in parallel is also provided.

Uploaded by

l215376
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views38 pages

Open MP3

The document discusses OpenMP environment variables and synchronization constructs. It reviews the OpenMP clauses like private, shared, default, reduction, if, and schedule. It describes synchronization directives like barrier, single, and master. Critical sections and the atomic directive are presented as ways to avoid race conditions when updating shared variables. Finally, environment variables for OpenMP like OMP_NUM_THREADS, OMP_DYNAMIC, OMP_SCHEDULE, and OMP_NESTED are explained. An example of calculating Pi using the Monte Carlo method in parallel is also provided.

Uploaded by

l215376
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Parallel and Distributed Computing

CS3006

Lecture 10
OpenMP-III
6th April 2022

Dr. Rana Asif Rehman

CS3006 - Spring 2022


Review of OpenMP Clause List

 Private
 firstprivate, lastprivate
 Shared
 Default
 private, shared, none
 Reduction
 If clause
 Schedule
 Static, dynamic, guided, runtime
 nowait

CS3006 - Spring 2022


Synchronization in
OpenMP
Barrier Directive

 On encountering this directive, all threads in a team


wait until others have caught up, and then release

#pragma omp barrier

CS3006 - Spring 2022


Single Directive

 A single directive specifies a structured block that is


executed by a single (arbitrary) thread in parallel
region
 Implicit barrier

#pragma omp single [clause list]


structured block

CS3006 - Spring 2022


Master Directive

 The master directive is a specialization of the single


directive in which only the master thread executes
the structured block
 No implicit barrier

#pragma omp master


structured block

CS3006 - Spring 2022


Critical Sections
(#pragma omp critical)
 A Critical Section is a code segment that has a shared variable
and need to be executed as an atomic action.
 It means that in a group of cooperating processes/threads,
at a given point of time, only one process must be
executing its critical section
 Forces threads to be mutex (mutually exclusive)
Only one thread at a time executes the given code section
double area, pi, x;
int i, n;
...
area = 0.0;
for (i = 0; i < n; i++) {
x += (i+0.5)/n; //can be calculated independently
area += 4.0/(1.0 + x*x); //requires mutex lock.
}
pi = area / n;

CS3006 - Spring 2022


Critical Sections
(#pragma omp critical)
 If we simply parallelize the loop... A race condition
may occur

double area, pi, x;


int i, n;
...
area = 0.0;
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
area += 4.0/(1.0 + x*x); //not atomic
}
pi = area / n;

CS3006 - Spring 2022


Critical Sections
(#pragma omp critical)
 Race Condition
Value of area Thread A Thread B

11.667
+ 3.765
11.667

15.432 + 3.563

15.230

 Thread A reads value of area first


 Thread B reads value of area before A can update its value
 Thread A updates value of area
 Thread B ignores update by A and writes its incorrect value to
area
CS3006 - Spring 2022
Critical Sections
(#pragma omp critical)
Race Condition
 A race condition is created when one process may
“race ahead” of another and overwrite the change
made by the first process to the shared variable

area 15.230 Answer should be 18.995

Thread A 15.432 Thread B 15.230

CS3006 - Spring 2022 area += 4.0/(1.0 + x*x)


Critical Sections
(#pragma omp critical)

 Critical section: a portion of code that only thread at a time may


execute
 We denote a critical section by putting the pragma
#pragma omp critical [(name)]
 Optional identifier name can be used to identify a critical region
 Solves the problem but, as only one thread at a time may
execute the statement; it becomes sequential code
double area, pi, x;
int i, n;
...
area = 0.0;
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
#pragma omp critical
area += 4.0/(1.0 + x*x);
}
CS3006 - Spring 2022 pi = area / n;
Atomic Directive

 The atomic directive specifies that the single


memory location update should be performed as
an atomic operation

#pragma omp atomic


Update instruction e.g., x++

CS3006 - Spring 2022


Environment
Variables in OpenMP
Environment Variables in OpenMP
 OpenMP provides additional environment variables
that help control execution of parallel programs

 OMP_NUM_THREADS
 OMP_DYNAMIC
 OMP_SCHEDULE
 OMP_NESTED

CS5009 - Advanced Operating Systems


Environment Variables in OpenMP
OMP_NUM_THREADS
 Specifies the default number of threads created
upon entering a parallel region.
 The number of threads can be changed during run-
time using:
 omp_set_num_threads(int threads) routine [OR]
 num_threads clause  num_threads(int threads)

 Setting OMP_NUM_THREADS to 4 using bash:


“ export OMP_NUM_THREADS=4 “

CS5009 - Advanced Operating Systems


Environment Variables in OpenMP

OMP_DYNAMIC
 when set to TRUE, allows the number of threads to be
controlled at runtime. It means Openmp will use its
dynamic adjustment algorithm to create number of
threads that may optimize system performance
 Incase of TRUE , total number of threads generated may not
be equal to the threads requested by using the omp_set_num
threads() function or the num_threads clause.
 Incase of FALSE, usually total no. of generated threads in a
parallel region become as requested by the num_threads
clause
 OpenMP routines for setting/getting dynamic status:
 void omp_set_dynamic (int flag); //disables if flag=0
 Should be called from outside of a parallel region
 int omp_get_dynamic (); //return value of dynamic status
CS5009 - Advanced Operating Systems
Environment Variables in OpenMP
OMP_DYNAMIC[dynamic.c]

workers = omp_get_max_threads(); //can use num_procs


printf("%d maximum allowed threads\n", workers);
printf("total number of allocated cores are:%d\n", omp_get_num_procs());
omp_set_dynamic(1);
omp_set_num_threads(8);
printf("total number of requested when dynamic is true are:%d\n", 8);
#pragma omp parallel
{
#pragma omp single nowait
printf("total threads in parallel region1=%d:\n", omp_get_num_threads());
#pragma omp for
for (i = 0; i < mult; i++)
{a = complex_func();}
}

CS5009 - Advanced Operating Systems


Environment Variables in OpenMP
OMP_DYNAMIC[dynamic.c]

omp_set_dynamic(0);
omp_set_num_threads(8);
printf("total number of requested when dynamic is false
are:%d\n", 8);
#pragma omp parallel
{
#pragma omp single nowait
printf("total threads in parallel region2=%d:\n",
omp_get_num_threads());
#pragma omp for
for (i = 0; i < mult; i++)
{a = complex_func();}
}

CS5009 - Advanced Operating Systems


Environment Variables in OpenMP

OMP_SCHEDULE

 Controls the assignment of iteration spaces associated


with for directives that use the runtime scheduling class
 Possible values: static, dynamic, and guided
 Can also be used along with chunk size [optional]
 If chunk size is not specified than default chunk-size of 1 is
used.

 Setting OMP_SCHEDULE to guided with minimum chunk


size of 4 using Ubuntu-based terminal:
“ export OMP_SCHEDULE= " guided,4" “
CS5009 - Advanced Operating Systems
Environment Variables in OpenMP

OMP_NESTED
 Default value is FALSE
 While using nested parallel pragma inside another, the
nested one is executed by the original team instead of
making new thread team.
 When TRUE
 Enables nested parallelism
 While using nested parallel pragma code inside another, it
makes a new team of threads for executing the nested
one.
 Use omp_set_nested(int val) with non-zero value to
set this variable to TRUE.
 When called with ‘0’ as argument, it set the variable to
FALSE
CS5009 - Advanced Operating Systems
Environment Variables in OpenMP
OMP_NESTED[nested.c]
omp_set_nested(0);
#pragma omp parallel num_threads(2)
{
#pragma omp single
printf("Level 1: number of threads in the team : %d\n",
omp_get_num_threads());

#pragma omp parallel num_threads(4)


{
#pragma omp single
printf("Level 2: number of threads in the team : %d\n",
omp_get_num_threads());
}
}

CS5009 - Advanced Operating Systems


Environment Variables in OpenMP
OMP_NESTED[nested.c]
omp_set_nested(1);
#pragma omp parallel num_threads(2)
{
#pragma omp single
printf("Level 1: number of threads in the team : %d\n",
omp_get_num_threads());

#pragma omp parallel num_threads(4)


{
#pragma omp single
printf("Level 2: number of threads in the team : %d\n",
omp_get_num_threads());
}
}

CS5009 - Advanced Operating Systems


Example
Computing Pi using Monti Carlo method

Preliminary Idea:

points in circle
Pi = 4 x ( )
points in square

Here a=0.5 , b=0.5 and r=0.5


CS5009 - Advanced Operating Systems
Computing Pi using Monti Carlo method

Steps
For all the random points
1. Calculate total points in the circle
2. Divide points in the circle to the points in the square
 Total number of points are also the total number of points inside the
square
3. Multiply this fraction with 4

As number of random points increases, the value of Pi


approaches to real value (i.e., 3.14179…..)

CS5009 - Advanced Operating Systems


Computing Pi using Monti Carlo method
Sequential Implementation
int niter= 100000000;
count=0;
seed(time(0));
for (i=0; i<niter;++i) //10 million
{
//get random points
x = (double)random()/RAND_MAX;
y = (double)random()/RAND_MAX;
z = ((x-0.5)*(x-0.5))+((y-0.5)*(y-0.5));
//check to see if point is in unit circle
if (z<0.25)
{
++count;
}
}
pi = ((double)count/(double)niter)*4.0; //p = 4(m/n)
printf("Seq_Pi: %f\n", pi);

CS5009 - Advanced Operating Systems


Computing Pi using Monti Carlo method
(Parallel construct [parallel_pi.c])
#pragma omp parallel shared(niter) private(i, x, y, z, chunk_size, seed) reduction(+ : count)
{
num_threads = omp_get_num_threads();
chunk_size = niter / num_threads;
seed=omp_get_thread_num();
#pragma omp master
{printf("chunk_size=%ld\n",chunk_size);}

count=0;
for (i=0;i<chunk_size; i++)
{
//get random points
x = (double)rand_r(&seed)/(double)RAND_MAX;
y = (double)rand_r(&seed)/(double)RAND_MAX;
z = ((x-0.5)*(x-0.5))+((y-0.5)*(y-0.5));
//check to see if point is in unit circle
if (z<0.25)
{
++count;
}
}
}
pi = ((double)count/(double)niter)*4.0;
CS5009 - Advanced Operating Systems
Parallelizing linked lists

Consider the following code:


current=head;
while(current->next != NULL){
complex_func(current->key); //complex consumer func
current=current->next;
}
 Assume that complex function can be computed foreach key
value independently
 The code can’t be parallelized directly as:
 We don’t have omp constructs to parallelize while loops and equivalent for
loop don’t have canonical form.
 This is because we don’t know number of iterations in advance
 If we simply put ‘omp parallel pragma’ before while, program semantics
will not be assured
CS5009 - Advanced Operating Systems
Parallelizing linked lists
[Naïve idea:1 with logical error]
Consider the following code:
current=head;
#pragma omp parallel firstprivate(current)
{
while(current-> next != NULL){
complex_func(current->key); //complex consumer func
current=current-> next;
}
}
 Creates team of threads, each with private ‘current’ variable.
 Each thread will execute for all the nodes in the list
 This means every thread will perform work equal to sequential
code
 No speedup achieved, this will rather increase execution time

CS5009 - Advanced Operating Systems


Parallelizing linked lists
[Naïve idea:2 with logical error]
Consider the following code:
current=head;
#pragma omp parallel shared(current)
{
while(current-> next!=NULL){ //line 1
complex_func(current->key); //complex consumer func
current=current-> next; //line 3
}
}
 Creates team of threads sharing same ‘current’ variable.
 For first while iteration, complex_func may be called by each
thread with same key-value.
 Semantics/atomics will not be ensured (i.e., multiple threads
executing line-3 can change line-1 result for other threads)
 So, output may not be as assumed
CS5009 - Advanced Operating Systems
Parallelizing linked lists
[Naïve but Correct parallelization]
Observations:
1. We don’t know in advance the number nodes in the list
2. We also don’t know how to access all the nodes parallelly
from the list. This because the linked-list can only be accessed
sequentially

3. We can parallelize it using the following steps


1. Count number of nodes in the list call it ‘C’
2. Allocate a dynamic array of pointers-to-list of size ‘C’. Now using loop,
copy address of ith node to the ith element in the pointers-array.
3. Now we can use for-loop that can iterate on this array of pointers.
Furthermore, this for-loop can also be parallelized

CS5009 - Advanced Operating Systems


Parallelizing linked lists
[Naïve but Correct parallelization]
1. Count number of nodes in the list call it ‘C’
//struct LIST{ int key; LIST* ptr; } list;

int C=0; LIST *p =head;


//Here assume head is pointer to the start of the list.
while(p != NULL){
p=p->next;
C++;
}

CS5009 - Advanced Operating Systems


Parallelizing linked lists
[Naïve but Correct parallelization]
2. Allocate a dynamic array of pointers-to-list of size ‘C’ and
using loop, copy address of ith node to the ith element in the
pointers-array

LIST **Parray = new LIST* [C];

p =head;
while(p != NULL){
Parray[i]=p;
p=p->next;
}

CS5009 - Advanced Operating Systems


Parallelizing linked lists
[Naïve but Correct parallelization]
3. Now we can use for-loop that can iterate on this array of
pointers. Furthermore, this for-loop can also be parallelized

#pragma omp parallel for schedule(static,1)


for(i=0;i<C;i++){
complex_func(Parray[i]->key);
}

 This method can result in speedups only if tasks are complex


enough to overcome the data-movement costs.
 Usually, data-movements are more costly than the
computations
 So, we need to devise another solution

CS5009 - Advanced Operating Systems


Parallelizing linked lists
[A relatively better implementation ]
//omptask.c and tasktime.c //execute using g++
#pragma omp parallel
{
#pragma omp single //single process will go into the region
{
current=head;
while(current->ptr!=NULL){
//following line creates a task and adds to logical task pool.
#pragma omp task firstprivate(current)
complex_func(current->key);

current = current->ptr;
}
}
}
Total threads=4
CS5009 - Advanced Operating Systems
Total complex itters= 100 Million
List size= 10 nodes
Parallelizing linked lists
[omp task illustration]

CS5009 - Advanced Operating Systems


Questions

CS3006 - Spring 2022


References
1. Kumar, V., Grama, A., Gupta, A., & Karypis, G. (2017). Introduction to parallel computing. Redwood City, CA:
Benjamin/Cummings.

CS3006 - Spring 2022

You might also like