Open MP
Open MP
The most significant challenge in parallelizing loops is handling data dependencies. A data dependency,
or loop-carried dependency, exists when an iteration of a loop relies on the result of a previous iteration.
Forcing such a loop to execute in parallel will lead to incorrect results, as the order of execution is no
longer guaranteed.
C
for (int i = 1; i < n; i++) {
data[i] = data[i-1] + input[i];
}
In this case, each iteration depends directly on the value computed in the i-1 iteration. If multiple threads
execute different iterations simultaneously, they will be reading data[i-1] before it has been correctly updated
by the thread handling the preceding iteration, leading to a race condition.
Solution: OpenMP itself doesn't automatically resolve loop-carried dependencies. The programmer must
refactor the algorithm to remove the dependency. In some cases, this is not possible, and the loop cannot
be parallelized with a simple omp for directive. For specific patterns like reductions (see below), OpenMP
provides dedicated clauses.
Even without explicit loop-carried dependencies, incorrect management of shared data can lead to race
conditions. This occurs when multiple threads attempt to read and write to the same memory location
without proper synchronization, resulting in unpredictable outcomes.
Challenges:
• Shared vs. Private Variables: Deciding which variables should be accessible by all threads
(shared) and which should have a unique copy for each thread (private) is crucial. A common
mistake is to implicitly share a temporary variable used within the loop, leading to data corruption.
• Reductions: A frequent pattern in loops is the accumulation of a value, such as summing up all
elements of an array. A naive parallelization would cause multiple threads to update the shared
sum variable concurrently, leading to lost updates.
Solutions:
• Data Scoping Clauses: OpenMP provides clauses to control the data environment:
o private(list): Each thread gets its own private copy of the variables in the list.
o firstprivate(list): Similar to private, but the private copy is initialized with the value of the master
thread's variable before the parallel region.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 1 | 59
Unit 5: OpenMP - Summary
o lastprivate(list):
The value of the private variable from the sequentially last iteration is copied
back to the master thread's variable after the loop.
o shared(list): Variables in the list are shared among all threads. This is often the default.
• reduction Clause: For operations like summation, multiplication, or logical operations, the reduction
clause is the ideal solution. It creates a private copy of the reduction variable for each thread,
performs the operation locally, and then combines the results from all threads into the final shared
variable after the loop.
Not all loop iterations take the same amount of time to execute. If the workload is unevenly distributed,
some threads may finish their assigned iterations long before others, leading to idle cores and inefficient
parallelization. This is known as load imbalance.
Challenges:
• Static Work Distribution: By default, OpenMP often divides the loop iterations statically among
threads. If the workload is concentrated in the early or late iterations, this static division is
suboptimal.
Solutions:
• The schedule Clause: OpenMP provides the schedule clause to control how loop iterations are
distributed among threads:
o schedule(static, chunk_size): Divides iterations into fixed-size chunks and assigns them to
threads in a round-robin fashion. This has low overhead but can be inefficient for imbalanced
loads.
o schedule(dynamic, chunk_size): Threads request a chunk of iterations as they become available.
This provides better load balancing but has higher overhead due to the dynamic assignment.
o schedule(guided, chunk_size): Starts with large chunks and progressively reduces the chunk
size. This is a compromise between static and dynamic scheduling, often providing good
performance.
o schedule(auto): The compiler and runtime system choose the most appropriate schedule.
Even when data is correctly declared as private, performance can be unexpectedly poor due to a
hardware-level issue called false sharing. This occurs when private variables of different threads happen
to reside on the same cache line. A cache line is the smallest unit of memory that can be transferred
between the main memory and the CPU cache.
When one thread writes to its private variable, the entire cache line is invalidated for all other threads that
have a copy of it, even if they are accessing different variables within that same cache line. This forces
the other threads to fetch the updated cache line from a higher level of memory, a much slower operation.
Solution: The primary way to mitigate false sharing is to ensure that private variables used by different
threads are located on different cache lines. This can be achieved through padding, by adding unused
variables to increase the memory separation between the critical variables. Some compilers and libraries
may also provide alignment directives.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 2 | 59
Unit 5: OpenMP - Summary
Overhead: The Cost of Parallelism
Creating and managing threads is not free. There is an inherent overhead associated with starting a
parallel region, synchronizing threads, and distributing work. For loops with very little work per iteration,
the overhead of parallelization can outweigh the benefits, leading to a slowdown compared to the
sequential version.
Solution: As a rule of thumb, only parallelize loops that have a significant amount of work. For nested
loops, it is generally more efficient to parallelize the outer loop, as this minimizes the number of times a
parallel region is entered and exited. The collapse clause can be used to parallelize multiple nested loops
together, increasing the total number of iterations to be distributed.
A loop-carried dependence occurs when an iteration of a loop depends on the result of a previous iteration.
If iteration i must wait for iteration i-1, then you cannot parallelize these iterations without introducing incorrect
behavior or data races.
Example:
int A[100];
for (int i = 1; i < 100; i++) {
A[i] = A[i-1] + 1; // i-th value depends on (i-1)-th value
}
Result:
// This is parallelizable
int A[100];
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
A[i] = i * 2;
}
Here, each A[i] is independent. So OpenMP can safely divide iterations across threads.
Sometimes, all threads contribute to a single shared variable, such as sum or max.
OpenMP Solution:
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 100; i++) {
sum += i;
}
OpenMP handles:
Correct use of shared and private clauses is crucial to avoid unexpected side effects.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 4 | 59
Unit 5: OpenMP - Summary
Common mistake:
int temp;
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
temp = i * 2; // All threads write to same 'temp'!
// ... do something with temp ...
}
Fix:
#pragma omp parallel for private(temp)
for (int i = 0; i < 100; i++) {
int temp = i * 2; // Each thread has its own 'temp'
}
5. Load Imbalance
What it is:
Some loop iterations may take longer than others, causing some threads to finish earlier and sit idle.
Problem:
Supported:
for (int i = 0; i < N; i++) // OK
for (int i = N; i > 0; i--) // OK
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 5 | 59
Unit 5: OpenMP - Summary
Not supported:
for (int i = 0; i < N; i *= 2) // BAD
Such complex loop control expressions cannot be parallelized using #pragma omp parallel for.
7. Nested Loops
Problem:
Solution:
#pragma omp parallel for collapse(2)
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 10; j++) {
A[i][j] = i + j;
}
}
8. False Sharing
What it is:
Occurs when threads update variables close in memory, causing cache invalidation.
Example:
int A[4];
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
A[i] = i; // May reside on the same cache line
}
Fix:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 6 | 59
Unit 5: OpenMP - Summary
Challenge Cause Solution
Loop-Carried Dependency Iteration i depends on i-1 Refactor loop, prefix sum
Reduction Multiple threads updating same variable Use reduction(+:sum)
Shared Variables Unintended variable sharing Use private(var)
Load Imbalance Uneven iteration times Use schedule(dynamic)
Complex Loops Non-linear index update Refactor loop
Nested Loops Only outer loop parallelized Use collapse(n)
False Sharing Cache conflicts Add padding or use local vars
#define SIZE 10
int main() {
int A[SIZE], sum = 0;
return 0;
}
Line-by-Line Explanation
Header and Setup
#include <stdio.h>
#include <omp.h>
#define SIZE 10
int temp;
#pragma omp parallel for
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 8 | 59
Unit 5: OpenMP - Summary
for (int i = 0; i < SIZE; i++) {
temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}
• All threads write to the same memory address, causing a race condition
• Output will be interleaved and inconsistent
A loop-carried dependence exists when an iteration of a loop relies on the outcome of a previous iteration.
The "dependence" is "carried" from one iteration to another, imposing a strict sequential order of execution.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 9 | 59
Unit 5: OpenMP - Summary
OpenMP's parallel for construct works by breaking a loop's iterations into chunks and assigning them to
different threads to be executed simultaneously. If a dependence exists, this simultaneous execution
violates the required sequential order, causing threads to read variables before they are correctly updated
or to overwrite values needed by other threads.
Let's break down the different types of loop-carried dependencies with clear examples.
This is the most common and intuitive type of dependence. It occurs when a statement in a later iteration
reads a value written by a statement in an earlier iteration.
C
// data_in = {3, 5, 2, 8, ...}
int data_out[N];
data_out[0] = data_in[0];
A reduction is a specific type of flow dependence where a single variable accumulates a result.
C
double sum = 0.0;
for (int i = 0; i < N; i++) {
sum = sum + array[i]; // READS and then WRITES to 'sum'
}
Here, each iteration reads the value of sum from the previous iteration, adds to it, and writes it back.
Parallelizing this naively would cause multiple threads to read and write to sum simultaneously, overwriting
each other's work.
OpenMP Solution for Reductions: This pattern is so common that OpenMP provides a specific clause
to handle it safely: reduction.
C
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum = sum + array[i];
}
The reduction(+:sum) clause tells OpenMP to create a private, thread-local copy of sum for each thread
(initialized to 0). Each thread calculates its partial sum. At the end of the loop, OpenMP automatically and
safely combines all the partial sums into the final global sum variable.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 10 | 59
Unit 5: OpenMP - Summary
2. Anti-Dependence
An anti-dependence occurs when a statement in a later iteration writes to a memory location that is read
by a statement in an earlier iteration. The order is critical: the read must happen before the write.
Example:
C
for (int i = 0; i < N - 1; i++) {
// Iteration 'i' READS a[i+1]
a[i] = b[i] + a[i+1];
// Iteration 'i+1' will WRITE to a[i+1]
}
• Dependence: Iteration i needs to read the original value of a[i+1] before iteration i+1 overwrites it.
• Why OpenMP Fails: Imagine Thread 0 gets i = 0 and Thread 1 gets i = 1.
o Thread 0 needs to read the original a[1].
o Thread 1 needs to calculate and write a new value to a[1].
o If Thread 1 executes first, it will overwrite a[1], and Thread 0 will then read the new, incorrect
value, violating the program's logic.
Solution: Often, anti-dependencies can be resolved by restructuring the code. In this case, creating a
temporary copy of the array a could break the dependence, though at the cost of memory.
3. Output Dependence
An output dependence occurs when two different iterations write to the same memory location. The order
of the writes is crucial for the final state of the variable.
Example:
C
for (int i = 0; i < N; i++) {
// Every iteration writes to the same variable 'x'
x = compute_something(i);
a[i] = x * 2;
}
• Dependence: The final value of x after the loop should be the one written by the very last iteration
(i = N-1).
• Why OpenMP Fails: With parallel execution, threads handling different iterations will be writing to
x in an unpredictable order. The final value of x after the parallel loop is indeterminate—it could be
the result from any of the threads, not necessarily the last one.
Solution: The use of the shared variable x is the problem. This can be resolved by making x local to the
loop's scope.
C
// Corrected Version
#pragma omp parallel for
for (int i = 0; i < N; i++) {
double x = compute_something(i); // 'x' is now private to each iteration
a[i] = x * 2;
}
By declaring x inside the loop, each thread gets its own private copy. OpenMP automatically handles this
for variables declared within the lexical scope of the loop. If x were declared outside the loop, you would
need to explicitly mark it with #pragma omp parallel for private(x).
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 11 | 59
Unit 5: OpenMP - Summary
How to Handle Loop-Carried Dependencies
Since OpenMP's parallel for directive does not automatically resolve these dependencies (except for the
reduction pattern), the responsibility falls to the programmer:
1. Identify the Dependence: Before adding any #pragma omp, you must analyze the loop to see if any
iteration depends on another. Ask yourself: "Does iteration i use a value calculated in iteration i-1?"
or "Do multiple iterations write to the same location where the final value matters?"
2. Choose a Strategy:
o Do Not Parallelize: If a true dependence is fundamental to the algorithm (like the prefix
sum), the loop cannot be parallelized as-is.
o Use reduction: If the dependence is a reduction pattern, use the reduction clause. This is the
simplest and most common solution for its specific pattern.
o Restructure the Algorithm: The most powerful technique is to rethink and rewrite the
algorithm to remove the dependence entirely. This might involve using temporary arrays,
changing the order of calculations, or adopting a completely different parallel-friendly
algorithm (e.g., a parallel prefix sum algorithm instead of a sequential one).
o Use Explicit Synchronization: In rare and complex cases, you might use #pragma omp critical
or #pragma omp atomic inside a loop. However, this introduces significant overhead and can
serialize the execution, defeating the purpose of parallelism. It is generally a last resort and
an indicator that a parallel for might not be the right tool for the job.
Key Feature:
• Iteration i reads or writes data that iteration i-1 has written or read.
• This creates a sequential dependency, blocking parallel execution.
If each iteration depends on the previous one’s result, you cannot safely distribute those iterations across multiple
threads.
Line-by-Line:
Workaround Techniques
1. Transform the algorithm
Summary Table
Feature Explanation
Loop-Carried Dependence When an iteration depends on a previous one
Problem Prevents safe parallelization in OpenMP
Example A[i] = A[i - 1] + 1
Solution Redesign algorithm (prefix sum, recursion, or pipeline)
Parallel-Safe? No, if true data dependency exists
A[0] = 1;
for (int i = 1; i < N; i++) {
A[i] = A[i-1] + input[i];
}
#define N 8
int main() {
int input[N] = {1, 2, 3, 4, 5, 6, 7, 8};
int prefix[N];
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 14 | 59
Unit 5: OpenMP - Summary
int i, j;
return 0;
}
prefix[i] += input[i];
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 16 | 59
Unit 5: OpenMP - Summary
1. Shared Data: A variable or memory location is accessible by multiple threads.
2. Concurrent Access: Multiple threads try to read from or write to this shared data at the same time.
3. At Least One Write: At a minimum, one of the threads is attempting to modify the data. If all threads
were only reading, there would be no data race.
The outcome is that the program's result can change each time it is run, depending on which thread "wins"
the race. This non-deterministic behavior makes data races notoriously difficult to debug.
Consider a simple loop designed to count the number of elements in an array that meet a certain condition.
C
int count = 0;
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
count++; // This is the critical operation
}
}
// 'count' will hold the correct total
Now, let's try to parallelize this with OpenMP in a naive way:
C
int count = 0;
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
count++; // DATA RACE!
}
}
// 'count' will almost certainly be wrong!
Why does this fail?
By default, the count variable declared outside the parallel region is shared among all threads. The operation
count++ is not a single, instantaneous machine instruction. It typically involves three separate steps:
1. Read: Read the current value of count from memory into a CPU register.
2. Increment: Add one to the value in the register.
3. Write: Write the new value from the register back to count in memory.
Now, imagine two threads, Thread A and Thread B, executing this concurrently:
OpenMP provides a suite of solutions to prevent data races, centered around two main strategies:
avoiding sharing and controlling access.
The best way to prevent a data race is to eliminate shared data altogether. If each thread has its own
private copy of a variable, there is no possibility of conflict.
• private(list):
This is the most fundamental clause. It declares that each thread should have its own
uninitialized, private copy of the variables in the list.
Sometimes, data must be shared. The classic counter example needs to update a single, final total. In
these cases, OpenMP provides mechanisms to ensure that only one thread can access the shared data
at a time.
• reduction(operator:list):
This is the preferred solution for the common data race pattern involving
accumulation (sum, product, max, min, etc.). It elegantly solves the problem by creating private
copies for the operation and then safely combining them at the end.
int count = 0;
// OpenMP creates a private 'count' for each thread, initialized to 0.
// At the end, it safely adds all private counts to the global 'count'.
#pragma omp parallel for reduction(+:count)
for (int i = 0; i < 100000; i++) {
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 18 | 59
Unit 5: OpenMP - Summary
if (array[i] > some_value) {
count++; // Each thread increments its own private copy
}
}
// 'count' is now correct!
• critical: This directivedefines a block of code (a "critical section") that can only be executed by one
thread at a time. It's a more general, but also more heavyweight, solution than reduction.
int count = 0;
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
#pragma omp critical
{
count++;
}
}
}
// This is correct, but slower than reduction due to higher overhead.
• atomic: This directive applies only to the single C/C++ statement that immediately follows it. It tells
OpenMP to ensure this specific update to a memory location happens atomically (as an
uninterruptible operation). It is more lightweight than critical but is limited to simple update statements
(like x++, x--, x += val).
int count = 0;
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
#pragma omp atomic
count++;
}
}
// Correct and generally more efficient than a critical section for this use case.
Summary: A Hierarchy of Solutions
When faced with potential data races in OpenMP, follow this general thought process:
1. Identify Shared Data: First, determine which variables are shared among threads.
2. Privatize When Possible: Can the variable be made private? If it's a temporary variable used only
within a loop iteration, declare it inside the loop or use the private clause. This is the best solution as
it involves no synchronization overhead.
3. Use reduction for Accumulators: If the operation is a sum, product, or other reduction, use the
reduction clause. It is optimized for this exact pattern and is far more efficient than manual locking.
4. Use atomic for Simple Updates: If you need to protect a simple, single update to shared memory
that isn't a reduction, atomic is the next best choice due to its lower overhead.
5. Use critical for Complex Updates: If you need to protect a larger, more complex block of code
involving multiple statements that must be executed together without interruption, use a critical
section. Use this sparingly, as it can significantly reduce parallelism.
int main() {
int sum = 0;
Explanation:
Final sum should be 10 + 5 + 7 = 22, but due to overlap, we only got 17.
Problem:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 21 | 59
Unit 5: OpenMP - Summary
}
• Slower than reduction, but good for complex code inside the block
Summary Table
Aspect Explanation
What is Data Race? Uncoordinated access to shared memory
Cause Multiple threads writing to same variable
Symptoms Inconsistent outputs, crashes
Detection Use compiler flags (-fopenmp -fsanitize=thread)
Solutions reduction, private, critical, atomic
When you initiate a parallel region in OpenMP (e.g., with #pragma omp parallel), you create a team of threads.
These threads exist in a shared memory space, but they each have their own execution stack. This leads
to two fundamental data scopes:
• Shared Data: A single instance of a variable exists in memory, and all threads can read from and
write to it. This is powerful for collaboration but is the main source of data races if not handled
carefully. Any variable declared outside the parallel region is, by default, shared.
• Private Data: Each thread gets its very own copy of the variable. One thread's private copy is
completely invisible and inaccessible to another thread. This is the safest way to work with
temporary or loop-specific variables. Variables declared inside the parallel region (or inside a loop
that is parallelized) are, by default, private.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 22 | 59
Unit 5: OpenMP - Summary
OpenMP allows you to define the default behavior for variables that you don't explicitly scope. The default
clause on a parallel directive can take one of three main arguments:
• default(shared): This is the most common default behavior in many OpenMP implementations. Any
variable not explicitly marked otherwise will be shared among all threads.
• default(private): This is not a valid option in the OpenMP standard. You cannot make all variables
private by default.
• default(none): This is the recommended practice for serious OpenMP development. It forces the
programmer to explicitly specify the data-sharing attribute for every single variable used within the
parallel region that is declared outside of it. While it requires more typing, it eliminates ambiguity
and forces you to think about data scope, preventing countless bugs.
C
int a, b, c;
#pragma omp parallel for default(none) shared(a) private(b, c)
for (int i = 0; i < N; i++) {
b = a + i; // OK: 'b' is private, 'a' is shared (read-only is safe)
c = b * 2; // OK: 'c' is private
// ...
}
If you forgot to scope a, b, or c, the compiler would throw an error, forcing you to fix it.
OpenMP provides a powerful set of clauses to precisely control the scope of each variable. These are
typically used with directives like omp parallel or omp for.
1. shared(list)
This clause explicitly declares that the variables in the list are to be shared among all threads. Since this
is often the default, it's most useful when you've set default(none).
• Use Case: When multiple threads need to access or update the same piece of data, such as a
large input array that is only being read, or a result array where each thread works on a different
section.
• Warning: Any shared variable that is written to concurrently by multiple threads must be protected
by synchronization (atomic, critical, reduction, or locks) to prevent data races.
2. private(list)
This is the most crucial clause for preventing data races with temporary variables. It declares that each
thread should get its own private, uninitialized copy of the variables in the list.
• Use Case: For temporary variables or counters used only within a single loop iteration.
• Key Point: The private copy is uninitialized. Its value is undefined at the start of the parallel region.
Likewise, any value assigned to the private copy inside the parallel region is discarded when the
thread finishes; it does not affect the original variable outside the region.
Example:
C
int temp = 5;
#pragma omp parallel for private(temp)
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 23 | 59
Unit 5: OpenMP - Summary
for (int i = 0; i < 4; i++) {
// 'temp' inside this loop is a new, uninitialized variable for each thread.
// It is NOT 5.
temp = i * 10;
printf("Thread %d, temp = %d\n", omp_get_thread_num(), temp);
}
// After the loop, the original 'temp' is still 5.
printf("Outside loop, temp = %d\n", temp);
3. firstprivate(list)
This clause behaves exactly like private but with one key difference: the private copy for each thread is
initialized with the value of the original variable from the master thread before the parallel region begins.
• Use Case: When threads need a starting value for a calculation that is based on a value computed
before the parallel section.
Example:
C
int start_value = 100;
#pragma omp parallel for firstprivate(start_value)
for (int i = 0; i < 4; i++) {
// Each thread's 'start_value' begins at 100.
printf("Thread %d, start_value = %d\n", omp_get_thread_num(), start_value);
start_value += i; // This only modifies the private copy
}
// After the loop, the original 'start_value' is still 100.
printf("Outside loop, start_value = %d\n", start_value);
4. lastprivate(list)
This clause is the counterpart to firstprivate. It creates an uninitialized private copy for each thread, but after
the parallel region, it copies the value from the thread that executed the sequentially last iteration back to
the original variable.
• Use Case: When you need to retrieve the result of a calculation that happened in the final iteration
of a loop.
Example:
C
int final_val;
#pragma omp parallel for lastprivate(final_val)
for (int i = 0; i < 1000; i++) {
final_val = i * i; // Each thread has its own 'final_val'
}
// After the loop, the 'final_val' from the thread that ran i=999
// is copied to the original 'final_val'.
// So, final_val will be 999 * 999 = 998001.
5. reduction(operator:list)
The reduction clause is a specialized, highly optimized solution for a common pattern of data sharing: when
multiple threads need to update a single variable using an associative mathematical operator (like +, *, -,
&, |).
• How it Works:
1. A private copy of the reduction variable is created for each thread.
2. This private copy is initialized to the identity value of the operator (0 for +, 1 for *, etc.).
3. Each thread performs its calculations on its private copy.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 24 | 59
Unit 5: OpenMP - Summary
4. At the end of the region, OpenMP automatically and safely combines the values from all
private copies into the original global variable.
• Use Case: Summing an array, finding a maximum/minimum value, performing bitwise operations.
This should always be preferred over manual protection with atomic or critical for these operations
as it is much more efficient.
Example:
C
long long sum = 0;
#pragma omp parallel for reduction(+:sum)
for (long long i = 0; i < 1000000; i++) {
sum += data[i]; // No data race here! Each thread adds to its private sum.
}
// After the loop, all private sums are combined into the global 'sum'.
The threadprivate Directive
Distinct from the clauses above, threadprivate is a directive used to create a global variable that persists
across multiple parallel regions but remains private to each thread.
C
int counter;
#pragma omp threadprivate(counter)
int main() {
#pragma omp parallel
{
// Each thread gets its own 'counter' which is initialized once (usually to 0)
counter = omp_get_thread_num();
} // End of first parallel region
Even with perfect data scoping, performance can be crippled by false sharing. This occurs when private
variables used by different threads happen to be located on the same cache line. When one thread writes
to its private variable, the hardware invalidates the entire cache line for all other threads, forcing them to
re-fetch it from main memory, even though they aren't using the same variable. This creates massive,
hidden synchronization overhead.
Solution: Manually pad your data structures to ensure that variables that will be accessed by different
threads are on different cache lines. This often involves adding unused buffer space.
By mastering these data-sharing concepts and clauses, you gain full control over your OpenMP programs,
enabling you to write parallel code that is not only correct but also scalable and efficient.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 25 | 59
Unit 5: OpenMP - Summary
I. In OpenMP, controlling whether a variable is shared or
private across threads is critical for both correctness and
performance.
• Race conditions
• Unexpected results
• Performance degradation
int main() {
int temp; // Shared by default
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}
return 0;
}
Problem:
int main() {
#pragma omp parallel for private(temp)
for (int i = 0; i < 4; i++) {
int temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}
return 0;
}
• Now each thread has its own temp, avoiding race conditions.
Firstprivate Example
int x = 5;
Lastprivate Example
int x = 0;
Common Mistakes
Mistake Explanation
Not declaring temp private Leads to race conditions
Using private but expecting initial value Use firstprivate instead
Using shared where values are updated Use reduction or atomic instead
Best Practices
1. Declare loop counters as private explicitly, even if they’re default.
2. Use reduction for shared counters.
3. Use firstprivate if you need an initialized value.
4. Avoid sharing temporary variables unless they’re read-only.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 28 | 59
Unit 5: OpenMP - Summary
OpenMP provides fine-grained control over this process through the schedule clause, which is applied to
for or parallel for directives.
Imagine a parallelized loop where some iterations are computationally expensive while others are very
quick.
A Problematic Workload:
C
#pragma omp parallel for
for (int i = 0; i < 16; i++) {
// Imagine work_function(i) takes 'i' seconds to complete.
// Early iterations are fast, later iterations are very slow.
work_function(i);
}
If you have 4 threads, the default scheduling might divide the work like this:
The result? Thread 0 finishes in 6 seconds and then sits idle for the remaining 48 seconds while it waits
for Thread 3 to complete its massive workload. This is a classic example of load imbalance, and it
severely limits the speedup gained from parallelism. The schedule clause is the tool to solve this problem.
1. schedule(static[, chunk_size])
This is the simplest and often the default scheduling kind. The iterations are divided among threads before
the loop starts execution.
• How it works (without chunk_size): The iteration space is divided into roughly equal-sized blocks,
one for each thread. This is known as block scheduling. schedule(static)
• How it works (with chunk_size): The iteration space is divided into contiguous chunks of size
chunk_size. These chunks are then distributed to the threads in a round-robin fashion. This is known
as interleaved or cyclic scheduling. schedule(static, 1)
Characteristics:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 29 | 59
Unit 5: OpenMP - Summary
• Overhead: Very low. The work distribution is calculated once at the beginning of the loop. There
is no further communication or synchronization needed for scheduling.
• Load Balancing: Can be poor if the work per iteration is unpredictable or uneven.
• Data Locality: Generally good, as threads work on contiguous blocks of iterations, which can lead
to better CPU cache performance.
• When all loop iterations have a uniform, predictable workload. For example, simple array addition:
a[i] = b[i] + c[i].
• When minimizing scheduling overhead is the absolute top priority.
• Use static, 1 when there's a dependency risk with adjacent iterations, for example, to mitigate false
sharing.
2. schedule(dynamic[, chunk_size])
This schedule is designed to address load imbalance by distributing work as threads become available.
• How it works: The iteration space is divided into chunks. By default, chunk_size is 1. When a thread
finishes its current chunk, it requests the next available chunk from the OpenMP runtime. This
continues until all chunks are processed.
Characteristics:
• Overhead: High. Each time a thread requests a new chunk, it requires synchronization, which adds
significant overhead compared to static.
• Load Balancing: Excellent. Since threads that finish quickly can immediately grab new work, it
naturally balances uneven loads. No thread sits idle if there is still work to be done.
• Data Locality: Can be poor. A thread may process iteration 5, then 95, then 150, jumping all over
the data and leading to poor cache performance.
• When the workload per iteration is highly variable, unpredictable, or unknown. For example,
processing items in a work queue where each item has a different complexity.
• When load balancing is more important than the overhead of scheduling.
• A larger chunk_size (e.g., dynamic, 16) is a good compromise, reducing the scheduling overhead
(fewer requests) while still providing good load balancing.
3. schedule(guided[, chunk_size])
This is an adaptive schedule that attempts to combine the best of static and dynamic. It starts with large
chunks to reduce overhead and moves to smaller chunks to handle load balancing at the end of the loop.
• How it works: Threads dynamically grab chunks of work, but the chunk size decreases over time.
It starts with a large chunk (proportional to the number of remaining iterations divided by the number
of threads) and exponentially shrinks down to the optional chunk_size parameter (or 1 if not
specified).
Characteristics:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 30 | 59
Unit 5: OpenMP - Summary
• Overhead: Medium. It's a compromise between the low overhead of static and the high overhead
of dynamic.
• Load Balancing: Very good. It provides the load balancing benefits of dynamic scheduling,
especially towards the end of the loop where imbalance is most likely to cause threads to go idle.
• Data Locality: Better than dynamic since threads initially work on larger, more contiguous blocks of
data.
• A very strong general-purpose choice for loops with some, but not extreme, load imbalance.
• When you want good load balancing without the full overhead penalty of schedule(dynamic, 1).
4. schedule(auto)
This gives control to the OpenMP compiler and runtime system, allowing them to choose the "best"
schedule based on their own internal heuristics.
• How it works: The decision is made by the implementation. It might analyze the loop structure or
use runtime information to choose between static, dynamic, or guided.
• When to use it: When you trust the compiler to make a better decision than you can, or when you
want to write portable code that allows the system to optimize for the specific hardware it's running
on. The actual behavior can vary between compilers.
5. schedule(runtime)
This defers the scheduling decision from compile-time to runtime. The actual schedule to be used is
determined by the value of the OMP_SCHEDULE environment variable.
• How it works: You compile the code with schedule(runtime). Before running the executable, you set
the environment variable. For example: export OMP_SCHEDULE="dynamic,4".
• When to use it: This is extremely useful for tuning application performance without
recompiling. You can experiment with different scheduling strategies and chunk sizes on your
target machine to find the optimal settings for a specific problem.
Summary Table
Schedule
Overhead Load Balancing Best For...
Kind
Poor (for uneven
static Very Low Perfectly uniform workloads.
loads)
dynamic High Excellent Highly irregular, unpredictable workloads.
A great general-purpose choice for many non-uniform
guided Medium Very Good
loops.
auto Varies Varies Letting the compiler/runtime decide.
Performance tuning and experimentation without
runtime Varies Varies
recompiling.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 31 | 59
Unit 5: OpenMP - Summary
Goal
To efficiently divide loop iterations among threads in a way that:
• Maximizes parallelism
• Minimizes load imbalance
• Avoids idle threads
The schedule clause determines how loop iterations are split across threads.
1. static
#pragma omp parallel for schedule(static)
Example:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 32 | 59
Unit 5: OpenMP - Summary
N = 12 Threads = 3 Chunks: 4 each
T0 0, 1, 2, 3
T1 4, 5, 6, 7
T2 8, 9, 10, 11
2. static, chunk_size
Threads = 3 Chunk = 2
T0: i = 0–1, 6–7
T1: i = 2–3, 8–9
T2: i = 4–5, 10–11
3. dynamic
#pragma omp parallel for schedule(dynamic, 2)
4. guided
#pragma omp parallel for schedule(guided)
• Like dynamic, but chunk size starts big and shrinks exponentially.
• Reduces scheduling overhead compared to dynamic.
• Often used when N is large.
export OMP_SCHEDULE="dynamic,4"
int main() {
#pragma omp parallel for schedule(dynamic, 2)
for (int i = 0; i < 12; i++) {
printf("Thread %d is handling iteration %d\n", omp_get_thread_num(), i);
}
return 0;
}
Explanation:
• schedule(dynamic, 2) means each thread picks 2 iterations at a time as soon as it’s free.
• Output will vary per run due to runtime decisions.
Best Practices
1. Use static for predictable iteration times.
2. Use dynamic or guided for unpredictable work per iteration.
3. Tune chunk_size — small = better balance, large = less overhead.
4. Use runtime + environment variable for experimentation without code recompilation.
A reduction operation (or "folding") reduces a set of values down to a single result using a specific
mathematical or logical operator. Common examples include:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 34 | 59
Unit 5: OpenMP - Summary
• Finding the maximum or minimum value in a dataset.
• Calculating a logical AND/OR across a series of boolean flags.
• Multiplying elements to find a factorial or total product.
The problem in a parallel context is that these operations create a loop-carried flow dependence, leading
to a data race.
C
long long sum = 0;
#pragma omp parallel for
for (int i = 0; i < 1000000; i++) {
// DATA RACE! Multiple threads read and write 'sum' concurrently.
sum += array[i];
}
// The final 'sum' will be incorrect and unpredictable.
Here, multiple threads simultaneously execute sum += array[i]. This operation is not atomic; it involves
reading sum, adding to it, and writing it back. Threads will constantly overwrite each other's work, leading
to lost updates and a wrong answer.
C
// Correct, but inefficient
#pragma omp parallel for
for (int i = 0; i < 1000000; i++) {
#pragma omp atomic
sum += array[i];
}
While this is correct, it's often inefficient. It forces threads to serialize their access to the sum variable,
creating a bottleneck and negating much of the benefit of parallelism, especially if the loop body is small.
Syntax: reduction(operator:list_of_variables)
1. Create Private Copies: For each thread in the team, OpenMP creates a new, private copy of the
reduction variable (e.g., sum).
2. Initialize Private Copies: Each private copy is initialized to the identity value for the specified
operator (e.g., 0 for +, 1 for *, the largest negative number for max).
3. Perform Local Computation: Each thread executes its portion of the loop iterations, but all
updates are made to its private copy only. Since no data is shared at this stage, there is no data
race and no need for synchronization.
4. Combine Results: After all threads have finished their loop iterations, OpenMP performs a final,
safe, and synchronized operation to combine the results from all private copies into the original,
global variable.
C
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 35 | 59
Unit 5: OpenMP - Summary
long long sum = 0;
// OpenMP handles everything: privatization, initialization, and final combination.
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 1000000; i++) {
// Each thread adds to its own private 'sum'
sum += array[i];
}
// After the loop, the global 'sum' is correct.
Key Operators and Use Cases
A common task is to find not just the max value, but also its index. A reduction is perfect for the value, but
the index requires a bit more care.
C
#include <omp.h>
#include <stdio.h>
#include <float.h> // For DBL_MIN
int main() {
double data[] = {1.5, 9.2, 0.5, -1.0, 9.8, 4.3, 7.1};
int n = sizeof(data) / sizeof(data[0]);
// Now that we know the true global maximum, we can find its location.
// This second loop can also be parallelized if 'n' is very large.
for (int i = 0; i < n; i++) {
if (data[i] == max_val) {
max_loc = i;
break; // Found the first occurrence
}
}
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 36 | 59
Unit 5: OpenMP - Summary
printf("Max value is %.2f at index %d\n", max_val, max_loc);
return 0;
}
Output: Max value is 9.80 at index 4
This two-step approach is often clearer and more efficient than trying to force a complex object into a
reduction.
1. Prefer reduction over atomic or critical: If your operation fits one of the standard reduction operators,
always use the reduction clause. It expresses your intent more clearly and gives the OpenMP runtime
the freedom to use highly optimized, often hardware-specific, combining trees, which is much faster
than serialized locking.
2. Use for Loop-Wide Aggregations: Reductions are designed for when you are aggregating a
single value across the entire iteration space of a loop.
3. Ensure the Operation is Associative: The operator must be associative (e.g., (a + b) + c is the
same as a + (b + c)). The order in which OpenMP combines the private results is not guaranteed, so
the operation must be independent of order. Floating-point addition is technically not perfectly
associative, but in most real-world scenarios, the tiny precision differences are acceptable.
4. Keep the Loop Body Lean: The performance benefit of reduction is most pronounced when the
work inside the loop (the part being parallelized) is significant enough to overcome the overhead of
thread creation and the final combination step. For trivial loops, a sequential version might still be
faster.
5. Handling More Complex Reductions (Arrays/Structs): The standard reduction clause does not
work directly on C-style arrays or structs. Newer versions of OpenMP (4.0 and later) allow for user-
defined reductions, but this is an advanced feature. A more common and portable approach for
reducing an array is to use a temporary private array and combine the results manually after the
parallel loop.
By effectively using the reduction clause, you can write parallel code that is not only safe from data races
but also highly scalable and efficient, cleanly expressing a fundamental parallel pattern to the Open-MP
runtime.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 37 | 59
Unit 5: OpenMP - Summary
If multiple threads update a shared variable, it causes a race condition.
How It Works:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 38 | 59
Unit 5: OpenMP - Summary
Example: Finding Maximum Value
int max_val = INT_MIN;
#pragma omp parallel for reduction(max:max_val)
for (int i = 0; i < N; i++) {
if (A[i] > max_val)
max_val = A[i]; // Parallel-safe
}
Each thread:
Example: Multiplication
int product = 1;
#pragma omp parallel for reduction(*:product)
for (int i = 1; i <= N; i++) {
product *= i;
}
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 39 | 59
Unit 5: OpenMP - Summary
for (int i = 1; i <= N; i++) {
sum += i;
prod *= i;
}
Concept Description
Purpose Combine thread-local results safely
Syntax reduction(op:var)
Benefits No race conditions, fast, compiler-optimized
Use Cases sum, product, max/min, logical ops, counters
Advanced declare reduction for custom types
Crucially, a work-sharing construct does not create new threads. It assumes a team of threads already
exists and its sole purpose is to split a specific piece of work among them. These constructs are the heart
of most OpenMP applications, allowing for easy and powerful parallelization of common programming
patterns.
A key feature of these constructs is the implicit barrier at the end. By default, all threads will wait at the
end of a work-sharing construct until every thread in the team has finished its part of the work. This
synchronization is vital for correctness, ensuring that subsequent code is not executed until the shared
work is complete. This barrier can be disabled with the nowait clause if synchronization is not needed.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 40 | 59
Unit 5: OpenMP - Summary
This is, by far, the most frequently used work-sharing construct in OpenMP. It is designed to split the
iterations of a for loop (or do loop in Fortran) among the threads in the team. This is a classic example of
data parallelism, where the same operation is performed on different pieces of data.
How it Works:
The for directive is placed immediately before a for loop. The OpenMP runtime automatically divides the
loop's iteration space (e.g., from 0 to N-1) and assigns a portion of the iterations to each thread.
Syntax:
C
#pragma omp parallel
{
// Other parallel code can go here
#pragma omp for [clauses]
for (int i = 0; i < N; i++) {
// Code inside the loop
}
// Implicit barrier: All threads wait here until the loop is fully complete.
}
Combined Directive:
For convenience, OpenMP allows a combined directive that both creates the parallel region and applies
the work-sharing for construct in one line. This is the most common usage pattern.
C
#pragma omp parallel for [clauses]
for (int i = 0; i < N; i++) {
// Code inside the loop
}
Key Clauses:
• schedule(kind, chunk_size): Controls how iterations are divided (e.g., static, dynamic, guided). This is
crucial for load balancing.
• private, firstprivate, lastprivate: Manage the data environment for variables used within the loop.
• reduction(operator:list): Safely performs reduction operations (e.g., summing into a shared variable).
• nowait: Removes the implicit barrier at the end of the loop.
Use Case: Any loop where the iterations are independent of each other. A perfect example is processing
elements of an array.
Example:
C
double a[1000], b[1000], c[1000];
// Initialize a and b...
How it Works:
The sections construct encloses a set of #pragma omp section blocks. The OpenMP runtime assigns each
section block to a different thread in the team. If there are more threads than sections, the extra threads
will do nothing and skip to the barrier at the end. If there are more sections than threads, threads will
execute multiple sections until all are complete.
Syntax:
C
#pragma omp parallel
{
#pragma omp sections [clauses]
{
#pragma omp section
{
// Code block A (e.g., perform_task_A())
}
• private, firstprivate, lastprivate, reduction: Work just as they do for the for construct, managing data for the
enclosed sections.
• nowait: Removes the implicit barrier.
Use Case: Functional decomposition, such as in a pipeline or when performing unrelated calculations that
can happen at the same time. For example, calculating an average, a median, and a standard deviation
on the same dataset can be done in three parallel sections.
Example:
C
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
printf("Task A running on thread %d\n", omp_get_thread_num());
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 42 | 59
Unit 5: OpenMP - Summary
Task B running on thread 1
Task A running on thread 0
The single construct is a specialization that ensures a block of code is executed by only one of the threads
in the team—whichever one reaches it first. The other threads skip the block and wait at the implicit barrier
at its end.
How it Works:
The first thread to encounter the single directive will execute the code block. All other threads will bypass
it.
Syntax:
C
#pragma omp parallel
{
// Code executed by all threads
#pragma omp single [clauses]
{
// This block is executed by only one thread.
printf("I am the one and only thread %d executing this.\n", omp_get_thread_num());
}
// Implicit barrier: All threads wait here until the single block is complete.
}
Key Clauses:
• private, firstprivate: Can be used to manage data within the single block.
• nowait: Allows other threads to proceed without waiting for the single thread to finish. This is very
commonly used.
Use Case:
• Initialization: Performing a setup task that only needs to happen once within a parallel region.
• Input/Output: Printing status messages or reading input in the middle of parallel work.
• Finalization: Committing results or finalizing a data structure after a parallel computation.
C
#pragma omp parallel
{
// Each thread does some work...
do_work();
// Now, have one thread print a progress update without stopping the others.
#pragma omp single nowait
{
printf("Work is 50%% complete.\n");
}
The #pragma omp master directive is very similar to single, but with two key differences:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 43 | 59
Unit 5: OpenMP - Summary
1. Who Executes: The code is only ever executed by the master thread (thread ID 0).
2. No Implicit Barrier: The master directive has no implicit barrier at the end. Other threads do not
wait for the master to finish.
master is often used for the same tasks as single (like I/O), but it's less flexible as it hard-codes the execution
to thread 0. single is generally preferred unless there is a specific reason to tie an action to the master
thread.
By combining these work-sharing constructs, a programmer can effectively orchestrate complex parallel
operations, matching the structure of the code (loops, independent tasks) to the appropriate OpenMP
directive.
Each section represents a different block of code. OpenMP assigns each section to a different thread for parallel
execution.
Syntax Overview
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
// Task 1
}
Simple Example
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
printf("Task A by thread %d\n", omp_get_thread_num());
}
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 44 | 59
Unit 5: OpenMP - Summary
#pragma omp section
{
printf("Task B by thread %d\n", omp_get_thread_num());
}
}
}
return 0;
}
Explanation:
Use nowait to let threads move ahead without waiting for others to finish their sections.
Common Mistakes
Mistake Explanation
Missing #pragma omp section Each task must be labeled as a section
Code inside sections but outside section That part is executed by every thread
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 45 | 59
Unit 5: OpenMP - Summary
Mistake Explanation
Too few threads Unused sections are not executed
Using sections inside parallel for Invalid: mixing parallel constructs wrongly
Each thread picks up one of the 3 different tasks: load, process, store.
Best Practices
1. Use sections for different tasks, not loops.
2. Make sure each task is properly marked with #pragma omp section.
3. Consider nowait if subsequent code doesn’t need to wait.
4. Use nested parallelism if each section itself can be parallelized.
Summary Table
Feature Description
Purpose Divide non-loop tasks across threads
Directive #pragma omp sections + #pragma omp section
Inside parallel Must be inside #pragma omp parallel
Synchronization Implicit barrier unless nowait is used
Use for Independent logic blocks, not loop iterations
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 46 | 59
Unit 5: OpenMP - Summary
By default, a standard C, C++, or Fortran compiler will ignore #pragma omp directives, treating them as
simple comments. To enable OpenMP, you must explicitly pass a specific flag during compilation.
For the most common open-source compilers, GCC (g++) and Clang (clang++), the flag is -fopenmp.
Bash
# To compile and create an executable named 'my_program'
g++ -fopenmp my_program.cpp -o my_program
# For C
gcc -fopenmp my_program.c -o my_program
Command Line Syntax (Clang/clang++):
Clang's OpenMP runtime is provided by the LLVM project. The compilation flag is the same.
Bash
# To compile with Clang
clang++ -fopenmp my_program.cpp -o my_program
When you use -fopenmp, the compiler does two main things:
1. Parses Pragmas: It recognizes and interprets the #pragma omp directives to generate threaded
code.
2. Links the Runtime Library: It automatically links the necessary OpenMP runtime library (like
libgomp for GCC), which manages thread creation, scheduling, and synchronization.
For the modern LLVM-based Intel compilers, the flag is -fiopenmp. For the classic Intel compilers (icc/ifort),
the flag was -qopenmp.
Bash
icx -fiopenmp my_program.cpp -o my_program
Once compiled, the executable is run like any other program. However, the OpenMP runtime's behavior
can be controlled through environment variables. The most important of these is OMP_NUM_THREADS.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 47 | 59
Unit 5: OpenMP - Summary
OMP_NUM_THREADS
This variable tells the OpenMP runtime how many threads to create for the parallel regions.
Bash
# Set the number of threads to 4 for the next command
export OMP_NUM_THREADS=4
Best Practices:
• If OMP_NUM_THREADS is not set, the OpenMP runtime will typically default to using the number of
available hardware cores on the system.
• You can also set the number of threads within the code using the omp_set_num_threads() function, but
using the environment variable is often more flexible for testing and deployment.
Debugging OpenMP is challenging because bugs are often non-deterministic. A program might work
correctly ten times with 2 threads, but fail on the eleventh run with 4 threads due to a subtle timing-
dependent race condition.
• Data Race: The most common bug. Multiple threads access a shared variable without proper
synchronization, and at least one access is a write. (See the "Managing Shared and Private Data"
explanation for solutions).
• Deadlock: A situation where two or more threads are blocked forever, each waiting for the other to
release a resource. This can happen with improper use of locks or ordered critical sections.
• False Sharing: A performance bug, not a correctness bug. Private data of different threads
happens to reside on the same cache line, causing performance degradation due to cache
invalidations.
Using printf can be a quick first step, but it has pitfalls in a parallel context. Standard output is a shared
resource, and having multiple threads print simultaneously can result in jumbled, unreadable output.
Thread-Safe Printing: To make printf useful, wrap it in a critical section to ensure only one thread can print
at a time.
C++
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
// ... do some work ...
When debugging, always compile with the -g flag to include debugging symbols, which are essential for
debuggers like GDB.
Bash
g++ -fopenmp -g my_program.cpp -o my_program
3. Using a Standard Debugger (GDB)
GDB (the GNU Debugger) has support for debugging threaded programs, including OpenMP.
• info threads: Lists all threads currently running in your program, their ID, and what line of code they
are executing.
• thread <ID>: Switches the GDB context to a specific thread. You can then inspect its call stack ( bt)
and local variables (p var).
• break <file>:<line> thread <ID>: Sets a breakpoint that only triggers for a specific thread.
• set scheduler-locking on: Freezes all other threads when the current thread hits a breakpoint. This is
extremely useful for examining the state of the entire application at a specific moment without
other threads changing things under the hood.
Bash
# Start GDB
gdb ./my_program
For complex race conditions, GDB might not be enough. Specialized tools are designed to detect these
issues automatically.
How to Use:
Bash
# Compile with -g, then run your program through Valgrind/Helgrind
valgrind --tool=helgrind ./my_program
Helgrind will monitor memory accesses and report any potential data races it finds, complete with stack
traces showing where the conflicting accesses occurred. This is one of the most powerful tools for finding
race conditions.
b) Intel® Inspector
This is a commercial graphical tool that is excellent at finding both correctness errors (races, deadlocks)
and performance issues (false sharing) in threaded applications. It provides a rich user interface to
visualize where and why a race condition is happening.
c) Compiler Sanitizers
Modern compilers like Clang and GCC have built-in "sanitizers" that can detect threading errors at runtime.
• ThreadSanitizer (-fsanitize=thread): Add this flag during compilation. When you run the program,
the sanitizer will halt execution and print a detailed report if it detects a data race. This is very
effective but does add significant runtime overhead.
Bash
# Compile with the sanitizer enabled
g++ -fopenmp -g -fsanitize=thread my_program.cpp -o my_program_sanitized
What Is OpenMP?
OpenMP (Open Multi-Processing) is a compiler-based API for shared-memory parallel programming in C, C++,
and Fortran.
It uses pragmas/directives (#pragma omp) that are processed by the compiler to generate multithreaded code.
my_program: my_program.o
$(CC) $(CFLAGS) -o my_program my_program.o
my_program.o: my_program.c
$(CC) $(CFLAGS) -c my_program.c
• Non-deterministic behavior
• Race conditions
• Deadlocks
1. Print Thread ID
#include <omp.h>
printf("Thread %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());
Variable Description
OMP_NUM_THREADS Sets number of threads
OMP_SCHEDULE Controls loop scheduling (if runtime)
OMP_DYNAMIC Allows dynamic thread adjustment
OMP_STACKSIZE Controls thread stack size
export OMP_NUM_THREADS=4
export OMP_SCHEDULE="dynamic,2"
break main
run
bt # Backtrace
info threads
thread 2 # Switch to thread 2
This shows OpenMP's runtime environment, threads, stack size, affinity, etc.
Fix it:
Summary Table
Task How to Do It
Enable OpenMP -fopenmp compiler flag
Set thread count export OMP_NUM_THREADS=4
Debug with GDB gdb ./my_program + info threads
Detect data races -fsanitize=thread
Schedule tuning export OMP_SCHEDULE="dynamic,2"
View thread layout export OMP_DISPLAY_ENV=TRUE
This detailed explanation covers the critical factors that govern OpenMP performance and the techniques
to optimize them.
Before diving into code, it's crucial to understand the theoretical ceiling on performance improvement.
Amdahl's Law states that the maximum speedup of a program is limited by its sequential fraction.
S=(1−P)+NP1
Where:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 53 | 59
Unit 5: OpenMP - Summary
Implication: If only 80% of your program can be parallelized (P=0.8), even with an infinite number of cores
(N→∞), the maximum speedup you can achieve is 1/(1−0.8)=5 times.
Practical lesson: Focus your efforts on parallelizing the most time-consuming parts of your application
(the "hotspots"). Profiling your code before parallelization is essential to identify where to spend your time.
The single most important practical factor in OpenMP performance is load balancing. If one thread is
assigned significantly more work than others, the remaining threads will finish early and sit idle, wasting
computational resources. The total execution time is dictated by the last thread to finish.
The Challenge: Uneven workloads in loop iterations. For example, processing a triangular matrix or a
loop where the work depends on the input data.
OpenMP's schedule clause on a for loop dictates how iterations are distributed among threads. Choosing
the right one is critical.
• schedule(static): Low overhead. The work is divided into contiguous chunks and assigned before the
loop starts.
o Best for: Loops where every iteration takes the same amount of time.
o Performance Trap: Can lead to terrible load imbalance if workloads are uneven.
• schedule(dynamic): High overhead. Threads request a chunk of iterations as they become free.
o Best for: Loops with unpredictable or highly variable iteration costs. It provides excellent
load balancing.
o Performance Trap: The overhead of threads constantly requesting work can be significant.
Using a larger chunk size (dynamic, 16) can mitigate this.
• schedule(guided): Adaptive. Starts with large chunks and progressively makes them smaller.
o Best for: A great general-purpose choice that balances low overhead at the start with fine-
grained load balancing at the end. Often a good first choice for non-uniform loops.
How threads access data is as important as how they perform computations. Modern CPUs are orders of
magnitude faster than main memory, making efficient use of CPU caches paramount.
When one thread writes to its variable, the hardware's cache coherency protocol invalidates the entire
cache line for all other threads. Even though the other threads aren't using that specific variable, they are
forced to discard the line and perform a slow re-fetch from main memory when they need to access their
own data on that same line.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 54 | 59
Unit 5: OpenMP - Summary
Ensure that data that will be written to by different threads is separated by at least one cache line's worth
of memory.
C
// Potential false sharing
int results[NUM_THREADS];
#pragma omp parallel
{
int id = omp_get_thread_num();
results[id] = do_calculation(); // Writing to adjacent memory locations
}
Example: With padding
C
// Padded to avoid false sharing
struct padded_int {
int value;
char padding[60]; // 64-byte cache line - 4 bytes for int
};
struct padded_int results[NUM_THREADS];
Optimization Techniques:
• Parallelize at the Outermost Level: It is much more efficient to parallelize one outer loop than to
repeatedly create and destroy threads inside an inner loop.
5. Efficient Synchronization
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 55 | 59
Unit 5: OpenMP - Summary
Synchronization is necessary for correctness but is also a major source of performance degradation
because it forces threads to wait.
How threads are mapped to physical CPU cores can significantly impact performance, especially on multi-
socket NUMA (Non-Uniform Memory Access) systems.
• Thread Affinity: This is the practice of "binding" or "pinning" a thread to a specific CPU core. This
can improve cache performance by ensuring a thread consistently runs on the same core, keeping
its data hot in that core's L1/L2 cache.
• Environment Variables:
o OMP_NUM_THREADS: Setting this to a value greater than the number of available physical
cores often leads to performance degradation due to thread-switching overhead (thrashing).
o OMP_PROC_BIND / GOMP_CPU_AFFINITY: These can be used to control how threads are bound
to cores (e.g., close to bind them to adjacent cores, spread to spread them out across sockets).
By systematically addressing these areas—profiling to find hotspots, ensuring good load balance with the
right schedule, managing data to maximize cache hits and avoid false sharing, minimizing overhead, and
using the most efficient synchronization strategy—you can transform a simple parallelized program into a
truly high-performance OpenMP application.
export OMP_NUM_THREADS=4
Example:
#pragma omp parallel for schedule(dynamic, 2)
3. Synchronization Overhead
Synchronization can become a performance bottleneck.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 57 | 59
Unit 5: OpenMP - Summary
4. Memory Access Patterns
Poor memory layout leads to cache misses and false sharing.
Fix:
5. Granularity of Work
• Too little work per thread = overhead dominates.
• Too much work per thread = poor load balance.
Rule of Thumb:
This improves:
• Cache reuse
• Memory locality
• Performance consistency
Benchmarking Performance
Measure execution time with omp_get_wtime():
double t1 = omp_get_wtime();
// parallel region
double t2 = omp_get_wtime();
printf("Time taken: %f seconds\n", t2 - t1);
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 58 | 59
Unit 5: OpenMP - Summary
Tools for Performance Tuning
Tool Purpose
gprof, perf General profiling
Intel VTune Advanced profiling and thread analysis
ThreadSanitizer Detect race conditions
LIKWID Cache and memory bandwidth profiling
OMP_DISPLAY_ENV Show thread layout and settings
Summary Table
Factor Recommendation
Threads Match to CPU cores
Scheduling Choose based on loop balance
Synchronization Use reduction, avoid critical
Memory behavior Prevent false sharing
Task size (granularity) Enough work per thread
Affinity Use core binding for locality
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 59 | 59