0% found this document useful (0 votes)
34 views

Lecture - 06 (Shared Memory Programming With OpenMP)

Uploaded by

Farah Jahangir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Lecture - 06 (Shared Memory Programming With OpenMP)

Uploaded by

Farah Jahangir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Parallel and Distributed

Programming
Dr. Muhammad Naveed Akhtar
Lecture – 06
Shared Memory Programming with OpenMP

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 2


Roadmap
• Writing programs that use OpenMP.
• Using OpenMP to parallelize many serial for loops with only small changes to the source code.
• Task parallelism.
• Explicit thread synchronization.
• Standard problems in shared-memory programming.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 3


Distributed and Shared Memory Systems

Shared Memory System

Distributed Memory System

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 4


OpenMP
• An API for shared-memory parallel programming.
• MP = multiprocessing
• Designed for systems in which each thread or process can potentially have access to all available
memory.
• System is viewed as a collection of cores or CPU’s, all of which have access to main memory.

#pragma
• Special preprocessor instructions.
• Typically added to a system to allow
behaviors that aren’t part of the basic C
specification.
• Compilers that don’t support the
pragmas ignore them.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 5
Hello World!

declares the various OpenMP


functions, constants, types, etc.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 6


Compiling and Running a Pthread program
Execute Compile
gcc compiler
./omp_hello <thread count> Produce debugging information
Specify Thread Count Turns on all warnings
Executable File Name Create this executable file name
source file
gcc −g −Wall −fopenmp −o omp_hello omp_hello. c
Execute with 4 Threads
Enable OpenMP support in compiler
./omp_hello 4

Hello from thread 2 of 4 Execute with only 1 Thread Execute without specifying Threads
Hello from thread 3 of 4
./omp_hello 1 ./omp_hello
Hello from thread 0 of 4
Hello from thread 1 of 4 Hello from thread 0 of 1 Segmentation fault
(core dumped)
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 7
OpenMP fork/join Model

# pragma omp parallel

• A fundamental concept in OpenMP is the parallel region


• An OpenMP program starts executing using one master thread
• When it hits a parallel region directive, it spawns a team of slave threads which execute the code in parallel

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 8


OpenMp parallel pragma and clause
• OpenMP Parallel Pragma
# pragma omp parallel
• Most basic parallel directive.
• The number of threads that run the following structured block of code is determined by the run-time
system.

• Clause
# pragma omp parallel num_threads ( thread_count )
• Text that modifies a directive.
• The num_threads clause can be added to a parallel directive.
• It allows the programmer to specify the number of threads that should execute the following block.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 9


Key Points
• There may be system-defined limitations on the number of threads that a program can start.
• The OpenMP standard doesn’t guarantee that this will actually start thread_count threads.
• Most current systems can start hundreds or even thousands of threads.
• Unless we’re trying to start a lot of threads, we will almost always get the desired number of
threads.
• In OpenMP parlance the collection of threads executing the parallel block — the original thread
and the new threads — is called a team, the original thread is called the master, and the
additional threads are called slaves.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 10


In case the compiler doesn’t support OpenMP
int my_rank = omp_get_thread_num ( );
# include <omp.h> int thread_count = omp_get_num_threads ( );

#ifdef _OPENMP # ifdef _OPENMP


int my_rank = omp_get_thread_num ( );
# include <omp.h>
int thread_count = omp_get_num_threads ( );
#endif # else
int my_rank = 0;
int thread_count = 1;
# endif

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 11


The Trapezoidal Rule

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 12


The Trapezoidal Rule

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 13


Pseudo-Code for a serial program

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 14


A First OpenMP Version
1. We identified two types of tasks:
a. computation of the areas of individual trapezoids, and
b. adding the areas of trapezoids.

2. There is no communication among the tasks in the first


collection, but each task in the first collection communicates
with task 1b.
3. We assumed that there would be many more trapezoids
than cores.

• So we aggregated tasks by assigning a contiguous block of


trapezoids to each thread (and a single thread to each core).

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 15


Un-prediction in Results (Race Condition)

Unpredictable results when two (or more)


threads attempt to simultaneously execute: We need Mutual Exclusion

global_result += my_result; # pragma omp critical


global_result += my_result;
only one thread can execute
the following structured block at a time
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 16
First version of OpenMP Program

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 17


First version of OpenMP Program (cont.)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 18


Scope of Variables

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 19


Scope
• In serial programming, the scope of a variable consists of those parts of a program in which the
variable can be used.
• In OpenMP, the scope of a variable refers to the set of threads that can access the variable in a
parallel block.
• A variable that can be accessed by all the threads in the team has shared scope.
• A variable that can only be accessed by a single thread has private scope.
• The default scope for variables declared before a parallel block is shared.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 20


The Reduction Clause

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 21


We are using this complex version to add each thread’s local calculation to get global_result.

Although we’d prefer this.

If we use this, there’s no critical section!

How to update the Global Trapezoidal sum ?

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 22


If we fix it like this…

… we force the threads to execute sequentially.

We can avoid this problem by declaring a private variable inside the parallel block and
moving the critical section after the function call.

Private to each Thread (Executes Parallel)


Critical Section (One Thread at a time)
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 23
Opinion of our Genius Students

I think we can do better.

Neither do I.

I don’t like it.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 24


Reduction operators (A Clean Approach)
• A reduction operator is a binary operation (such as addition or multiplication).
• A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in
order to get a single result.
• All of the intermediate results of the operation should be stored in the same variable: the reduction variable.

Reduction clause inside parallel directive.


+,*,-,&,|,ˆ,&&,||

OLD

NEW

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 25


The “Parallel For” Directive

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 26


Parallel for
• Forks a team of threads to execute the following/coming structured block.
• However, the structured block following the parallel for directive must be a for loop.
• Furthermore, with the parallel for directive the system parallelizes the for loop by dividing the
iterations of the loop among the threads.

OpenMP Exclusive (Parallel for)

Serial Approach

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 27


How Does OpenMP Do it?
1. The reduction variable is shared
2. OpenMP creates a local variable for each thread
3. Local variables are initialized with identity values for given operator
4. After parallel block ends, the values in private variables are combined into shared variables

Legal forms for parallelizable for statements

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 28


Caveats
• The variable index must have integer or pointer type (e.g., it can’t be a float).
• The expressions start, end, and increment must have a compatible type. For example, if index is a
pointer, then increment must have integer type.
• The expressions start, end, and increment must not change during execution of the loop.
• During execution of the loop, the variable index can only be modified by the “increment
expression” in the for statement.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 29


Data dependencies
First 10 Fibonacci Numbers

What happened?
• OpenMP compilers don’t check for dependences among iterations in
a loop that’s being parallelized with a parallel for directive.
• A loop in which the results of one or more iterations depend on other
iterations cannot, in general, be correctly parallelized by OpenMP.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 30


Estimating value of π

Serial Solution

OpenMP solution #1

loop dependency

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 31


Estimating value of π (cont.)
OpenMP solution #1

Ensures factor has private scope.

OpenMP solution #2
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 32
The default clause
• Lets the programmer specify the scope of each variable in a block.
default (none)
• With this clause the compiler will require that we specify the scope of each variable we use in the block
and that has been declared outside the block.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 33


More About Loops in OpenMP: Sorting

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 34


Serial Bubble & Odd-Even Transposition Sorts

Serial Bubble Sort

Serial Odd-Even Transposition Sort

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 35


First OpenMP Odd-Even Sort

Evan Phase ……

Odd Phase ……

Even Phase

Even Phase Serial Code


Odd Phase

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 36


1st vs 2nd OpenMP Odd-Even Sort

Times are in seconds.


Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 37
Scheduling Loops

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 38


Scheduling Loops

We want to parallelize this loop. Our definition of function f.

Assignment of work using cyclic partitioning (round robin).


t = thread count; n = item count (loop max)
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 39
Results (Why Scheduling is Important? )
• f(i) calls the sin function i times.
• Assume the time to execute f(2i) requires approximately twice as much time as the time to
execute f(i).

3.67 seconds 2.76 seconds 1.84 seconds

• n = 10,000 • n = 10,000 • n = 10,000


• Threads = 1 • Threads = 2 • Threads = 2
• speedup = 1.33 • speedup = 1.99
• Serial • Default assignment • Cyclic assignment

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 40


The Schedule Clause
• Default schedule:

• Cyclic schedule:

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 41


The Schedule Clause (cont.)
schedule(<type> [, <chunksize>])
• Type can be:
• static: the iterations can be assigned to the threads before the loop is executed.
• dynamic or guided: the iterations are assigned to the threads while the loop is executing.
• auto: the compiler and/or the run-time system determine the schedule.
• runtime: the schedule is determined at run-time.

• The chunksize is a positive integer (only static, dynamic, guided).

twelve iterations, 0, 1, . . . , 11, and three threads


Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 42
The Dynamic Schedule Type
• The iterations are also broken up into chunks of chunksize consecutive iterations.
• Each thread executes a chunk, and when a thread finishes a chunk, it requests another one from
the run-time system.
• This continues until all the iterations are completed.
• The chunksize can be omitted. When it is omitted, a chunksize of 1 is used.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 43


The Guided Schedule Type
Trap. Rule; 9999 Iterations; guided schedule; two threads.
• Each thread also executes a chunk, and
when a thread finishes a chunk, it requests
another one.
• However, in a guided schedule, as chunks
are completed the size of the new chunks
decreases by 1/2 of previous assignment.
• If no chunksize is specified, the size of the
chunks decreases down to 1.
• If chunksize is specified, it decreases down
to chunksize, with the exception that the very
last chunk can be smaller than chunksize.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 44


The Runtime Schedule Type
• The system uses the environment variable OMP_SCHEDULE to determine at run-time how to
schedule the loop.
• The OMP_SCHEDULE environment variable can take on any of the values that can be used for a
static, dynamic, or guided schedule.

setenv OMP_SCHEDULE "guided,4"


setenv OMP_SCHEDULE "dynamic"
setenv OMP_SCHEDULE "nonmonotonic:dynamic,4"

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 45


Producers and Consumers

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 46


Queues
• Can be viewed as an abstraction of a line of customers waiting to pay for their groceries in a
supermarket.
• A natural data structure to use in many multithreaded applications.
• For example, suppose we have several “producer” threads and several “consumer” threads.
• Producer threads might “produce” requests for data.
• Consumer threads might “consume” the request by finding or generating the requested data.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 47


Message-Passing
• Each thread could have a shared message queue, and when one thread wants to “send a
message” to another thread, it could enqueue the message in the destination thread’s queue.
• A thread could receive a message by dequeuing the message at the head of its message queue.
Message Loop
(Each Thread)

Sending Message (Send_Msg)


Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 48
Receiving Messages and Termination Detections
Receiving Message (Try_Receive)

each thread increments this after


completing its for loop

Termination Detection (Done)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 49


Messaging Startup
• When the program begins execution, a single thread, the master thread, will get command line
arguments and allocate an array of message queues: one for each thread.
• This array needs to be shared among the threads, since any thread can send to any other thread,
and hence any thread can enqueue a message in any of the queues.
• One or more threads may finish allocating their queues before some other threads.
• We need an explicit barrier so that when a thread encounters the barrier, it blocks until all the
threads in the team have reached the barrier.
• After all the threads have reached the barrier all the threads in the team can proceed.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 50


The Atomic Directive

• Unlike the critical directive, it can only protect critical sections that consist of a single C assignment statement.
• Further, the statement must have one of the following forms:

• Here <op> can be one of the binary operators (+, *, -, /, &, ^, |, <<, >>)
• Many processors provide a special load-modify-store instruction.
• A critical section that only does a load-modify-store can be protected much more efficiently by using this
special instruction rather than the constructs that are used to protect more general critical sections.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 51


Critical Sections by Name

• OpenMP provides the option of adding a name to a critical directive:


• When we do this, two blocks protected with critical directives with different names can be
executed simultaneously.
• However, the names are set during compilation, and we want a different critical section for each
thread’s queue.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 52


Locks
• A lock consists of a data structure and functions that allow the programmer to explicitly enforce
mutual exclusion in a critical section.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 53


Using Locks in the Message-Passing Program
void omp_init_lock(omp_lock_t* lock_p);
void omp_set_lock(omp_lock_t* lock_p);
void omp_unset_lock(omp_lock_t* lock_p);
void omp_destroy_lock(omp_lock_t* lock_p);

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 54


Some Caveats for Locks
• You shouldn’t mix the different types of mutual exclusion for a single critical section.

• There is no guarantee of fairness in mutual exclusion constructs.


it’s possible that a thread can be blocked forever in waiting

• It can be dangerous to “nest” mutual exclusion constructs.

guaranteed deadlock solution


Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 55
Matrix-vector multiplication

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 56


Matrix-Vector Multiplication (OpenMP)

Run-times and efficiencies of matrix-vector multiplication


(times are in seconds)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 57


Thread-Safety

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 58


Repeated Slide

Thread-Safety
• A block of code is thread-safe if it can be simultaneously executed by multiple threads without
causing problems.
Example
• Suppose we want to use multiple threads to “tokenize” a file that consists of ordinary English text.
• The tokens are just contiguous sequences of characters separated from the rest of the text by
white-space — a space, a tab, or a newline.
• Divide the input file into lines of text and assign the lines to the threads in a round-robin fashion.
• The first line goes to thread 0, the second goes to thread 1, . . . , the tth goes to thread t, etc.
Solution (Simple Approach)
• We can serialize access to the lines of input using semaphores.
• After a thread has read a single line of input, it can tokenize the line using the strtok function.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 59


Repeated Slide

The strtok function


• The first time it’s called the string argument should be the text to be tokenized. (The line of input)
• For subsequent calls, the first argument should be NULL.
• The idea is that in the first call, strtok caches a pointer to string, and for subsequent calls it returns
successive tokens taken from the cached copy.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 60


Multi-threaded Tokenizer

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 61


Repeated Slide

Running the Tokenizer

Using 1 Thread (Works Correctly) Using 2 Threads


Pease porridge hot.
Pease
Pease porridge in the pot
porridge
Pease
hot.
porridge
Pease porridge cold.
in
Pease
the
porridge Oops!
pot
cold.
Nine days old.
Nine
days
old.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 62


Repeated Slide

What happened?
• strtok caches the input line by declaring a variable to have static storage class.
• This causes the value stored in this variable to persist from one call to the next.
• Unfortunately for us, this cached string is shared, not private.
• Thus, thread 0’s call to strtok with the third line of the input has apparently
overwritten the contents of thread 1’s call with the second line.
• So the strtok function is not thread-safe. If multiple threads call it
simultaneously, the output may not be correct.
• Thread Safe Routines in C Library

Random generator random in stdlib.h, time conversion function localtime in time.h and many others are also not thread safe
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 63
Concluding Remarks
• OpenMP : • OpenMP provides several mechanisms for insuring
• A standard for programming shared-memory mutual exclusion in critical sections.
systems. • Critical directives and Named critical directives
• uses both special functions and preprocessor • Atomic directives and Simple locks
directives called pragmas. • By default most systems use a block-partitioning of
• programs start multiple threads rather than multiple the iterations in a parallelized for loop.
processes.
• OpenMP offers a variety of scheduling options.
• Many OpenMP directives can be modified by • In OpenMP the scope of a variable is the collection
clauses. of threads to which the variable is accessible.
• A major problem of shared memory programs is • A reduction is a computation that repeatedly
the possibility of race conditions. applies the same reduction operator to a sequence
of operands in order to get a single result.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 64


Questions and comments?

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 65

You might also like