0% found this document useful (0 votes)
4 views61 pages

Unit Iii

The document provides an overview of OpenMP, an API for writing portable, multithreaded applications, detailing its components, challenges in threading, and performance optimization techniques. It discusses issues like data races, deadlocks, and memory management, along with practical examples of loop threading and scheduling strategies. Additionally, it highlights the importance of proper synchronization and the use of OpenMP library functions to enhance parallel programming efficiency.

Uploaded by

easwaran0508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views61 pages

Unit Iii

The document provides an overview of OpenMP, an API for writing portable, multithreaded applications, detailing its components, challenges in threading, and performance optimization techniques. It discusses issues like data races, deadlocks, and memory management, along with practical examples of loop threading and scheduling strategies. Additionally, it highlights the importance of proper synchronization and the use of OpenMP library functions to enhance parallel programming efficiency.

Uploaded by

easwaran0508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

OpenMP PROGRAMMING

Syllabus Content
• OpenMP
• Threading a loop
• Thread overheads
• Performance issues
• Library functions
• Solutions to parallel programming problems
‒ Data races, deadlocks and live locks
‒ Non-blocking algorithms
‒ Memory and cache related issues
OpenMP (Open Multi-Processing)
• Formulated in 1997 as an API for writing portable, multithreaded applications.
• Fortran-based standard, but later grew to include C and C++.
• The current version is OpenMP Version 2.5, which supports FORTRAN, C, and C++.
• Intel C++ and Fortran compilers support the OpenMP Version 2.5 standard.
• Provides a platform-independent set of compiler pragmas, directives, function calls, and environment
variables that explicitly instruct the compiler how and where to use parallelism in the application.
• Many loops can be threaded by inserting only one pragma right before the loop, as demonstrated by
examples in this chapter.
• By leaving the nitty-gritty details to the compiler and OpenMP runtime library, you can spend more time
determining which loops should be threaded and how to best restructure the algorithms for performance on
multi-core processors.
• The full potential of OpenMP is realized when it is used to thread the most time- consuming loops, that is,
the hot spots.
OpenMP (Open Multi-Processing)
OpenMP Components
• Compiler Directives and Clauses
• Runtime Libraries
• Environment Variables
Compiler Directives and Clauses
Runtime Libraries
Environment Variables
Challenges in Threading a Loop
• Loop-carried Dependence
• Data Race Condition
• Managing Shared and Private Data
• Loop Scheduling and Partitioning
1. Loop-carried Dependence
• True (flow) dependence ‒ RAW
• Anti-dependence ‒ WAR
• Output dependence ‒ WAW
Loop-carried Dependence

True (flow) dependence ‒ RAW


Anti-dependence ‒ WAR
Output dependence ‒ WAW
Example 1:
for( i=1; i<100; i++ ) {
a[i] = …;
…;
… = a[i-1];
}

Loop-carried dependence, not parallelizable


for (i=0; i<n; i++){
a[i] = b[i] + 1; //S1
c[i] = a[i] + 2; //S2
}

i=0 i=1 i=2


S1: a[0] = b[0] + 1 S1: a[1] = b[1] + 1 S1: a[2] = b[2] + 1
S2: c[0] = a[0] + 2 S2: c[1] = a[1] + 2 S2: c[2] = a[2] + 2
Loop-carried Dependence
Example
x[0] = 0;
y[0] = 1;
#pragma omp parallel for private (k)
for ( k = 1; k < 100; k++ ) {
x[k] = y[k-1] + 1; Loop Carried Dependence

y[k] = x[k-1] + 2;
Predetermine the initial value of x[49] and
} y[49];
Loop strip mining technique
x[0] = 0;

y[0] = 1;

x[49] = 74; //derived from the equation x(k)=x(k-2)+3

y[49] = 74; //derived from the equation y(k)=y(k-2)+3

#pragma omp parallel for private(m, k)

for (m=0, m<2; m++) {

for ( k = m*49+1; k < m*50+50; k++ ) {

x[k] = y[k-1] + 1; // S1

y[k] = x[k-1] + 2; // S2

}
2. Data Race Condition
• Data-race conditions are due to
‒ output dependences, in which multiple threads attempt to update
the same memory location, or variable, after threading.
• OpenMP does not perform the detection of data-race
conditions.
• The code needs to be modified via Privatization or
synchronized using mechanisms like mutexes.
When multiple threads compete to update the same resource, it causes
inconsistent data (called a race condition).
Data Race Condition cont.…..
• data-race conditions - difficult to spot
• using the full thread synchronization tools in the Windows API or in Pthreads,
developers are more likely to avoid these issues
• using OpenMP, it is easier to overlook data-race conditions
• Intel Thread Checker, which is an add-on to Intel VTuneTM Performance
Analyzer.
3. Managing Shared and Private Data
understanding which data is shared and which is private becomes extremely
important
3. Managing Shared and Private Data
• OpenMP makes this distinction
‒ Shared - all threads access the exact same memory
location
‒ Private - a separate copy of the variable is made for
each thread to access in private
By Default - all the variables in a parallel region are shared, with exceptions
in parallel for loops, the loop index is private
variables that are local to the block of the parallel region are private
any variables listed in the private, first private, last private, or reduction clauses are private
• The developer’s responsibility to indicate to the compiler which pieces of memory should be shared
Memory can be declared as private in the following three ways.
• Private,
• Firstprivate,
• Lastprivate,
A lastprivate(temp) clause will copy the last loop (stack) value of temp to the (global)
temp storage when the parallel DO is complete.

A firstprivate(temp) would copy the global temp value to each stack’s temp.
• Use the thread private pragma to specify the global variables that need to be
private for each thread.
• Declare the variable inside the loop without the static keyword
The following loop fails to function correctly because the variable x is shared.

#pragma omp parallel for


for ( k = 0; k < 100; k++ ) {
x = array[k];
array[k] = do_work(x);
}
The variable x is specified as private.
#pragma omp parallel for private(x)
for ( k = 0; k < 100; k++ )
{
x = array[k];
array[k] = do_work(x);
}
4. Loop Scheduling and Partitioning
• Good load balancing
‒ achieve optimal performance
‒ have effective loop scheduling and partitioning
‒ To ensure that the execution cores are busy most
‒ with minimum overhead of scheduling, context switching and synchronization
• With a poorly balanced workload
‒ threads may finish significantly before others
static-even scheduling.
• parallel for or work sharing for loop uses static-even scheduling
• equal number of iterations
• 100 iterations and 10 threads ---- 100/10 = 10
• m iterations and N threads ---- m/N
• minimize the chances of memory conflicts
static-even scheduling.
#pragma omp parallel for
for ( k = 0; k < 1000; k++ ) do_work(k);
• Assume 1000 iteration and 2 threads
• iterations 0 to 499 on one thread and 500 to 999 on the other thread
• Loop-scheduling and partitioning information is conveyed ----schedule clause
#pragma omp for schedule (kind [,chunk-size])
#pragma omp for schedule (Dynamic,16])
The Four Schedule Schemes in OpenMP
Schedule Type Description

static (default with no Partitions the loop iterations into equal-sized chunks or as nearly equal as possible

chunk size)

Uses an internal work queue to give a chunk-sized block of loop iterations to each thread as it
dynamic
becomes available.

Similar to dynamic scheduling, but the chunk size starts off large and shrinks
guided

Uses the OMP_SCHEDULE environment variable at runtime to specify which one of the three
runtime loop-scheduling types should be used.
Dynamic Scheduling
• For dynamic scheduling,
‒ the chunks are handled with the first-come, first-serve scheme,
‒ the default chunk size is 1.
‒ equal to the chunk size specified in the schedule clause for each thread, except the last chunk.
‒ The last set of iterations may be less than the chunk size.
‒ For Example:
‒ (dynamic, 16)
‒ total number of iterations is 100,
‒ 16, 16,16,16,16,16, 4
Guided scheduling
• For guided scheduling,
π k = β k / 2N
• N is the number of threads,
• πk denotes the size of the k th chunk,
• βk denotes the number of remaining unscheduled loop iterations
Guided scheduling

Runtime Scheduling Scheme
• Not a scheduling scheme
• gives the end-user some flexibility in selecting the type of scheduling
dynamically
export OMP_SCHEDULE=dynamic,16
Example
Effective Use of Reductions
for ( k = 0; k < 100; k++ ){
sum = sum + func(k);
}

sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++) {
sum = sum + func(k);
}
Reduction Operators and Reduction Variable’s Initial Value in
OpenMP

Operator Initialization Value


+ (addition) 0
- (subtraction) 0
* (multiplication) 1
& (bitwise and) ~0
| (bitwise or) 0
^ (bitwise exclusive or) 0
&& (conditional and) 1
Minimizing Threading Overhead
• simple fork-join execution model
‒ compiler and run-time library
‒ with lower threading overhead
‒ can improve your application performance
Measured Cost of OpenMP Constructs and Clauses
Constructs Cost (in microseconds) Scalability

parallel 1.5 Linear

Barrier 1.0 Linear or O(log(n))

Schedule (static) 1.0 Linear

schedule(guided) 6.0 Depends on contention

schedule(dynamic) 50 Depends on contention

ordered 0.5 Depends on contention

Single 1.0 Depends on contention

Reduction 2.5 Linear or O(log(n))

Atomic 0.5 Depends on data-type and hardware

Critical 0.5 Depends on contention

Lock/Unlock 0.5 Depends on contention


To reduce overhead
Work-sharing Sections
• Directs the OpenMP compiler and runtime to distribute the
identified sections in the team created for the parallel
region
Performance-oriented Programming
• OpenMP provides
‒ a set of important pragmas
‒ runtime functions
• Enable thread synchronization and related actions to facilitate
correct parallel programming.
1. Using Barrier and No wait
2. Interleaving Single-thread and Multi-thread Execution
3. Data Copy-in and Copy-out
4. Protecting Updates of Shared Variables
Using Barrier and No wait
• Barriers are a form of synchronization method
• Threads will wait at a barrier until all the threads in the parallel region have
reached the same point
Example 2
Example for nowait
Interleaving Single-thread and Multi-thread Execution

• a program may consist of both serial and parallel code segments


• single-thread execution---OpenMP provides a way to specify
Data Copy-in and Copy-out
• how to copy in the initial value of a private variable to initialize its private copy
• copy out the value of the private variable computed in the last iteration
• OpenMP standard provides four clauses
• firstprivate,
• lastprivate,
• copyin,
• copyprivate
• firstprivate provides a way to initialize the value
• lastprivate provides a way to copy out the value
• copyin provides a way to copy the master thread’s value to
members
• copyprivate clause is allowed to associate with the single
construct - broadcast action ‒ earlier completeion
Protecting Updates of Shared Variables

• for avoiding data-race conditions.


#pragma omp parallel
{
if ( max < new_value )
max = new_value
}

#pragma omp parallel critical(max)


{
if ( max < new_value )
max = new_value
}
Atomic pragma
• Directs the compiler to generate code to ensure that the specific memory storage
is updated atomically
• x++
• ++x
• x --
• --x
Intel Taskqueuing Extension to OpenMP

• Taskqueuing extension to OpenMP allows a programmer to parallelize control


structures such as
‒ recursive function,
‒ dynamic-tree search,
‒ pointer-chasing
Taskqueuing Execution Model
OpenMP Library Functions
• OpenMP provides
‒ Pragmas
the highest degree of simplicity and portability
can be easily switched off
‒ a set of functions calls
• to add the conditional compilation in your programs
‒ environment variables
The Most Heavily Used OpenMP Library Functions
Most Commonly Used Environment Variables for OpenMP
Data Races, Deadlocks, and Live Locks
• Unsynchronized access to shared memory can introduce race conditions
• Results depend non-deterministically
Example
t=x u=x
x=t+1 x=u+2
Tools
• Intel Thread Checker is a powerful tool for detecting potential race conditions.
• Mask
• Sometimes race conditions are intended and useful
• “latest current value.”
• Some times in syn. Access can produce data race
If ( !list.contains( Key)

Another thread might insert key


values between contains and insert
call
list.insert( Key)
Dead lock

Tread1 ----lock A Thread2----lock B


Deadlock can occur only if the following four conditions
1. Access to each resource is exclusive.
2. A thread is allowed to hold one resource while requesting
another.
3. No thread is willing to relinquish a resource that it has
acquired.
4. There is a cycle of threads trying to acquire resources, where
each resource is held by one thread and requested by another.
Deadlock can be avoided by breaking any one of these conditions
Ordering the lock
void AcquireTwoLocksViaOrdering( Lock& x, Lock& y ) {
assert( &x!=&y );
if( &x<&y ) {
acquire x
acquire y
} else {
acquire y
acquire x
}
}

You might also like