Unit Iii
Unit Iii
Syllabus Content
• OpenMP
• Threading a loop
• Thread overheads
• Performance issues
• Library functions
• Solutions to parallel programming problems
‒ Data races, deadlocks and live locks
‒ Non-blocking algorithms
‒ Memory and cache related issues
OpenMP (Open Multi-Processing)
• Formulated in 1997 as an API for writing portable, multithreaded applications.
• Fortran-based standard, but later grew to include C and C++.
• The current version is OpenMP Version 2.5, which supports FORTRAN, C, and C++.
• Intel C++ and Fortran compilers support the OpenMP Version 2.5 standard.
• Provides a platform-independent set of compiler pragmas, directives, function calls, and environment
variables that explicitly instruct the compiler how and where to use parallelism in the application.
• Many loops can be threaded by inserting only one pragma right before the loop, as demonstrated by
examples in this chapter.
• By leaving the nitty-gritty details to the compiler and OpenMP runtime library, you can spend more time
determining which loops should be threaded and how to best restructure the algorithms for performance on
multi-core processors.
• The full potential of OpenMP is realized when it is used to thread the most time- consuming loops, that is,
the hot spots.
OpenMP (Open Multi-Processing)
OpenMP Components
• Compiler Directives and Clauses
• Runtime Libraries
• Environment Variables
Compiler Directives and Clauses
Runtime Libraries
Environment Variables
Challenges in Threading a Loop
• Loop-carried Dependence
• Data Race Condition
• Managing Shared and Private Data
• Loop Scheduling and Partitioning
1. Loop-carried Dependence
• True (flow) dependence ‒ RAW
• Anti-dependence ‒ WAR
• Output dependence ‒ WAW
Loop-carried Dependence
y[k] = x[k-1] + 2;
Predetermine the initial value of x[49] and
} y[49];
Loop strip mining technique
x[0] = 0;
y[0] = 1;
x[k] = y[k-1] + 1; // S1
y[k] = x[k-1] + 2; // S2
}
2. Data Race Condition
• Data-race conditions are due to
‒ output dependences, in which multiple threads attempt to update
the same memory location, or variable, after threading.
• OpenMP does not perform the detection of data-race
conditions.
• The code needs to be modified via Privatization or
synchronized using mechanisms like mutexes.
When multiple threads compete to update the same resource, it causes
inconsistent data (called a race condition).
Data Race Condition cont.…..
• data-race conditions - difficult to spot
• using the full thread synchronization tools in the Windows API or in Pthreads,
developers are more likely to avoid these issues
• using OpenMP, it is easier to overlook data-race conditions
• Intel Thread Checker, which is an add-on to Intel VTuneTM Performance
Analyzer.
3. Managing Shared and Private Data
understanding which data is shared and which is private becomes extremely
important
3. Managing Shared and Private Data
• OpenMP makes this distinction
‒ Shared - all threads access the exact same memory
location
‒ Private - a separate copy of the variable is made for
each thread to access in private
By Default - all the variables in a parallel region are shared, with exceptions
in parallel for loops, the loop index is private
variables that are local to the block of the parallel region are private
any variables listed in the private, first private, last private, or reduction clauses are private
• The developer’s responsibility to indicate to the compiler which pieces of memory should be shared
Memory can be declared as private in the following three ways.
• Private,
• Firstprivate,
• Lastprivate,
A lastprivate(temp) clause will copy the last loop (stack) value of temp to the (global)
temp storage when the parallel DO is complete.
A firstprivate(temp) would copy the global temp value to each stack’s temp.
• Use the thread private pragma to specify the global variables that need to be
private for each thread.
• Declare the variable inside the loop without the static keyword
The following loop fails to function correctly because the variable x is shared.
static (default with no Partitions the loop iterations into equal-sized chunks or as nearly equal as possible
chunk size)
Uses an internal work queue to give a chunk-sized block of loop iterations to each thread as it
dynamic
becomes available.
Similar to dynamic scheduling, but the chunk size starts off large and shrinks
guided
Uses the OMP_SCHEDULE environment variable at runtime to specify which one of the three
runtime loop-scheduling types should be used.
Dynamic Scheduling
• For dynamic scheduling,
‒ the chunks are handled with the first-come, first-serve scheme,
‒ the default chunk size is 1.
‒ equal to the chunk size specified in the schedule clause for each thread, except the last chunk.
‒ The last set of iterations may be less than the chunk size.
‒ For Example:
‒ (dynamic, 16)
‒ total number of iterations is 100,
‒ 16, 16,16,16,16,16, 4
Guided scheduling
• For guided scheduling,
π k = β k / 2N
• N is the number of threads,
• πk denotes the size of the k th chunk,
• βk denotes the number of remaining unscheduled loop iterations
Guided scheduling
•
Runtime Scheduling Scheme
• Not a scheduling scheme
• gives the end-user some flexibility in selecting the type of scheduling
dynamically
export OMP_SCHEDULE=dynamic,16
Example
Effective Use of Reductions
for ( k = 0; k < 100; k++ ){
sum = sum + func(k);
}
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++) {
sum = sum + func(k);
}
Reduction Operators and Reduction Variable’s Initial Value in
OpenMP