0% found this document useful (0 votes)
14 views37 pages

OPENMP

Uploaded by

cvidal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views37 pages

OPENMP

Uploaded by

cvidal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Parallel Programming

with OPENMP
OPENMP: Motivation
Sequential program uses a single core/
processor while all other processors
are idle.

Using OMP pragmas can enable


utilizing all processors in parallel for a
program.
OPENMP: Motivation
https://www.openmp.org/about/whos-using-openmp
/
Parallel Gaussian Elimination Using OpenMP on 4
processors

Matlab – TSA & NaN toolbox


OpenMP is used in two core functions
(sumskipnan_mex&covm_mex), which compute the sum, the
covariance matrix, and counts the samples which are not NaN.
Speed up of 11 has been observed on multicore machines with
12 cores
OPENMP : Overview
• Collection of compiler directives and library functions for
creating parallel programs for shared-memory computers.
• The “MP” in OpenMP stands for “multi-processing”(shared-
memory parallel computing)
• Combined with C, C++, or Fortran to create a
multithreading programming language, in which all
processes are assumed to share a single address space.
• Based on the fork / join programming model: all programs
start as a single (master) thread, fork additional threads
where parallelism is desired (the parallel region), then join
back together.
• Version 1.0 with fortran in 1997, supporting C & C++ there
after, currently at version 5.0 in 2018.
OpenMP: Goals

Standardization: Provide a standard among a


variety of shared memory architectures/platforms
Lean and Mean: Establish a simple and limited set
of directives for programming shared memory
machines. Significant parallelism can be
implemented by using just 3 or 4 directives.
Ease of Use: Provide capability to incrementally
parallelize a serial program. Provide the capability to
implement both coarse-grained and fine-grained
parallelism
Portability: Supports Fortran (77, 90, 95…), C, and
C++. Public forum for API and membership
OpenMP: Core Elements
OPENMP #pragma
Special preprocessor instructions.
Typically added to a system to allow
behaviors that aren’t part of the basic C
specification.
Compilers that don’t support the pragmas
ignore them.
#pragma omp parallel
How many threads?
(OMP_NUM_THREADS)
Shared variables
S
#pragma omp parallel
h
{
Variables (private) a
P0 – Thread 0 Private
Stack r
Parallel Region e
d
P1 – Thread 1 Private
Stack V
a
}
r
P2 – Thread 2 Private i
Code within the parallel region is executed in Stack
a
parallel on all processors/threads.
b
P3 – Thread 3 Private l
Stack e
s
OpenMP - #pragma
Hello World - OpenMP
Fork: master thread
creates a team of
parallel threads.

Threads are
numbered from 0
Structured block of code (master thread) to N-1
Implicit barrier at the
end of a parallel section.
master thread executes
sequentially until the
Join: Team of threads complete first parallel region is
the statements in the parallel encountered.
region, synchronize and Parallelism added
terminate incrementally until
performance goals are
met.
OPENMP: Basic functions
OPENMP: basic functions
Each thread has its own stack, so it
will have its own private (local)
variables.
Each thread gets its own rank -
omp_get_thread_num
The number of threads in the team -
omp_get_num_threads
In OpenMP, stdout is shared among
the threads, so each thread can
execute the printf statement.
There is no scheduling of access
to stdout, output is non-
OPENMP: Run Time Functions
Create a 4 thread Parallel region :
Statements in the program that are enclosed by the parallel
region construct are executed in parallel among the various
team threads.

Each thread calls pooh(ID,A) for ID = 0 to 3


OpenMP Run Time Functions
Modify/check/get info about the number of threads
omp_get_num_threads() //number of threads in use
• omp_get_thread_num() //tells which thread you are
• omp_get_max_threads() //max threads that can be used
Are we in a parallel region? omp_in_parallel()
How many processors in the system? omp_get_num_procs()
Set explicit locks and several more...

OpenMP Environment Variables


USE SET command in Windows or printenv command
OMP_NUM_THREADS: Sets the maximuminnumber ofcurrent
linux to see threads in the parallel
environment variables
region, unless overridden by omp_set_num_threads or num_threads.
OpenMP parallel regions

Branching in or out of a structured block is not allowed!


OpenMP parallel regions
Serial code – Variable declarations, functions etc. When should I
int a,b,c = 0; execute this code
float x = 1.0; in parallel?
if clause
#pragma omp parallel num_threads 8 private(a) …..
{
My Parallel Region (piece of code)
Which variables
are local to each
int i = 5; thread?
int j = 10; private clause
int a =threadNumber;

Number of threads or Which variables are


copies of the parallel shared across all
region to execute threads?
num_threads shared clause
OPENMP: Variable Scope
• In OpenMP, scope refers to the set of threads that can see
a variable in a parallel block.
• A general rule is that any variable declared outside of a
parallel region has a shared scope. In some sense, the
“default” variable scope is shared.
• When a variable can be seen/read/written by all threads
in a team, it is said to have shared scope;
• A variable that can be seen by only one thread is said to
have private scope. Each thread has a copy of the private
variable.
• Loop variables in an omp for are private
• Local variables in the parallel region are private
• Change default behavior by using the clause
default(shared) or default(private)
OpenMP: Data Scoping
Challenge in Shared Memory Parallelization => Managing Data
Environment
Scoping
OpenMP Shared variable : Can be Read/Written by all Threads in the
Loop variables in an omp for are private;
team.
int i;
OpenMP Private
Local variables variable
in the : Each
parallel regionThread
are hasint
itsj;own local copy of this
private. #pragma omp parallel private(j)
variable
{
Alter default behaviour with the {default} int k;
clause: i = ……. Shared
#pragma omp parallel default(shared)
private(x) j = ……..
{ ... } Private
k=…
#pragma omp parallel default(private) }
shared (matrix)
{ ... }
OpenMP: private Clause

• Reproduce the private variable for each thread.


• Variables are not initialized.
• The value that Thread1 stores in x is different from
the value Thread2 stores in x
OpenMP: firstprivate Clause

• Creates private memory location for iper for each


thread.
• Copy value from master thread to each memory
location
• While initial value is same, it can be changed by
threads and subsequently Thread 0 Thread 1 and
2.. Might have different values of the firstprivate
variable
OpenMP: Clauses & Data
Scoping
Schedule Clause

Data
Sharing/Scope
Matrix Vector
Multiplication

#pragma omp parallel


num_threads(4)
for (i=0; i < m; i++)
{ y[i] =0.0;
for (j=0; j < SIZE; j++)
y[i] += (A[i][j] * x[j]);
}

Is this reasonable?
Matrix Vector
Multiplication #pragma omp parallel shared(A,x,y,SIZE) \
private(tid,i,j,istart,iend)
{
tid = omp_get_thread_num();
int nid = omp_get_num_threads();
istart = tid*SIZE/nid;
iend = (tid+1)*SIZE/nid;

for (i=istart; i < iend; i++)


{
for (j=0; j < SIZE; j++)
Matrix Rows = N (= 8) y[i] += (A[i][j] * x[j]);
Number of Threads = T (=4)
Number of Rows processed by thread = N/T printf(" thread %d did row %d\t
Thread 0 => rows 0,1,2,3,…(N/T – 1) y[%d]=%.2f\t",tid,i,i,y[i]);
Thread 1 => rows N/T, N/T+1…… 2*N/T - 1
}
Thread t => rows t, t+1, t+2, …. (t*N/T -1)
} /* end of parallel construct */
Matrix Vector
Multiplication
omp_set_num_threads(4)
#pragma omp for must be inside
#pragma omp parallel shared(A,x,y,SIZE)
a parallel region (#pragma omp
{
parallel)
#pragma omp for
for (int i=0; i < SIZE; i++)
No new threads are created
{
but the threads already
for (int j=0; j < SIZE; j++)
created in the enclosing
y[i] += (A[i][j] * x[j]);
parallel region are used.
}
} /* end of parallel construct */
The system automatically
parallelizes the for loop by dividing
Matrix Rows = N (= 8) the iterations of the loop among
Number of Threads = T (=4) the threads.
Number of Rows processed by thread = N/T
Thread 0 => rows 0,1,2,3,…(N/T – 1) User can control how to divide the
Thread 1 => rows N/T, N/T+1…… 2*N/T - 1 loop iterations among threads by
Thread t => rows t, t+1, t+2, …. (t*N/T -1) using the schedule clause.

User controlled Variable Scope


#pragma omp for
#pragma omp parallel
for

OpenMP takes care of partitioning


the iteration space for you.
Threads are assigned independent
sets of iterations.
There is no implied barrier upon
entry to a work-sharing construct,
There is an implied barrier at the
end of a work sharing construct
OpenMP: Work Sharing
Data parallelism
• Large amount of data elements and each data
element (or possibly a subset of elements) needs
to be processed to produce a result. When this
processing can be done in parallel, we have data
parallelism (for loops)
Task parallelism
• A collection of tasks that need to be completed. If
these tasks can be performed in parallel you are
faced with a task parallel job
Work Sharing: omp
for
Computing ∏ by method of Numerical Integration

Divide the interval (x axis) [0,1] into N parts. Xi + Xi+1


Area of each rectangle is x * y [ x = 1/N, y = 4/ (1+x 2)] =[1/N] *4/ (1+x2)
Approximation of x as midpoint of the interval before computing Y 2
Serial Code

static long num_steps = 100000;


double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i++)
{
x = (I + 0.5) * step; task1
sum = sum + 4.0 / (1.0 + x*x);
}
task2
pi = step * sum
}

1. Computation of the areas of There is no communication


individual rectangles among the tasks in the first
2. Adding the areas of collection, but each task in the
rectangles. first collection communicates
with task 2
Computing ∏ by method of Numerical
Integration
static long num_steps = 100000; #include <omp.h>
double step; #define NUM_THREADS 4
static long num_steps = 100000;
void main ()
double step;
{ void main ()
int i; double x, pi, sum = 0.0; {
step = 1.0 / (double) int i; double x, pi, sum = 0.0;
num_steps; step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i+ omp_set_num_threads(NUM_THREADS);
+) { #pragma omp parallel for shared(sum)
x = (I + 0.5) * step; private(x)
sum = sum + 4.0 / (1.0 + for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
x*x);
sum = sum + 4.0 / (1.0 + x*x);
} }
pi = step * sum pi = step * sum
} }

Serial Code Parallel Code


Race Condition
#pragma omp parallel for
shared(global_result) private(x,
myresult)
for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
myresult = 4.0 / (1.0 + x*x);
global_result +=
myresult;
}

Unpredictable results when


two (or more) threads
attempt to simultaneously
execute:
global_result +=
myresult
Handling Race Conditions

Thread 0
Global_result +=2
Mutual Exclusion:
Only one thread at a time executes the
statement
Thread1
Global_result +=3
Pretty much sequential

Thread2
Global_result +=4
Handling Race Conditions

omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for shared(sum)
private(x) Mutual Exclusion:
for (I = 0; I <= num_steps; i++) { Only one thread at a
x = (I + 0.5) * step; time executes the
#pragma omp critical
sum = sum + 4.0 / (1.0 + x*x); statement
} sum = sum + 4.0 / (1.0 +
x*x);

Use synchronization to protect data conflicts.


Mutual Exclusion (#pragma omp critical)
Mutual Exclusion (#pragma omp atomic)
Synchronization could be expensive so:
Change how data is accessed to minimize the need
for synchronization.
OpenMP: Reduction

sum = 0;
set_omp_num_threads(8)
#pragma omp parallel
for
reduction (+:sum)
for (int i = 0; i < 16; i++)
{
sum += a[i]
}
Thread0 => iteration 0 &
1
Thread1 => iteration 2 &
3
Thread local/private
………
One or more variables that are private to each thread are subject of
reduction operation at the end of the parallel region.
#pragma omp for reduction(operator : var)
Operator: + , * , - , & , | , && , ||, ^
Combines multiple local copies of the var from threads into a single copy at
master.
Computing ∏ by method of Numerical
Integration
static long num_steps = 100000; #include <omp.h>
double step; #define NUM_THREADS 4
static long num_steps = 100000;
void main ()
double step;
{ void main ()
int i; double x, pi, sum = 0.0; {
step = 1.0 / (double) int i; double x, pi, sum = 0.0;
num_steps; step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i+ omp_set_num_threads(NUM_THREADS);
+) { #pragma omp parallel for
x = (I + 0.5) * step; reduction(+:sum) private(x)
sum = sum + 4.0 / (1.0 + for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
x*x);
sum += 4.0 / (1.0 + x*x);
} }
pi = step * sum pi = step * sum
} }

Serial Code Parallel Code


omp for Parallelization

Can all loops be parallelized?


Loop iterations have to be independent.

Simple Test: If the results differ when the code is


executed backwards, the loop cannot by parallelized!

for (int i = 2; i < 10; i++)


{
x[i] = a * x[i-1] + b
}

Between 2 Synchronization points, if atleast 1 thread


writes to a memory location, that atleast 1 other
thread reads from => The result is non-deterministic
Recap
What is OPENMP?
Fork/Join Programming model
OPENMP Core Elements
#pragma omp parallel OR Parallel construct
run time variables
environment variables
data scoping (private, shared…)
work sharing constructs
#pragma omp for
sections
tasks
schedule clause
synchronization
compile and run openmp program in c++ and fortran

You might also like