0% found this document useful (0 votes)
0 views59 pages

Open MP

OpenMP is a model for parallel programming that simplifies the development of parallel applications, particularly for loop-level parallelism. Key challenges include handling data dependencies, managing shared and private variables, addressing load imbalance, and mitigating false sharing, all of which require careful consideration and specific solutions like data scoping clauses and scheduling options. Effective use of OpenMP involves understanding these challenges and applying appropriate strategies to ensure correct and efficient parallel execution.

Uploaded by

Rituraj Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views59 pages

Open MP

OpenMP is a model for parallel programming that simplifies the development of parallel applications, particularly for loop-level parallelism. Key challenges include handling data dependencies, managing shared and private variables, addressing load imbalance, and mitigating false sharing, all of which require careful consideration and specific solutions like data scoping clauses and scheduling options. Effective use of OpenMP involves understanding these challenges and applying appropriate strategies to ensure correct and efficient parallel execution.

Uploaded by

Rituraj Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Unit 5: OpenMP - Summary

1.Threading Loops with OpenMP: Navigating the


Challenges
OpenMP offers a powerful yet straightforward approach to parallelizing code, particularly loops, making it
a popular choice for leveraging multi-core processors. However, transforming a sequential loop into a
correct and efficient parallel one is not always a simple matter of adding a pragma. Several challenges
can arise, from ensuring data integrity to optimizing performance. Understanding these hurdles is key to
effectively using OpenMP.

Data Dependencies: The Arch-Nemesis of Parallel Loops

The most significant challenge in parallelizing loops is handling data dependencies. A data dependency,
or loop-carried dependency, exists when an iteration of a loop relies on the result of a previous iteration.
Forcing such a loop to execute in parallel will lead to incorrect results, as the order of execution is no
longer guaranteed.

Consider a simple prefix sum calculation:

C
for (int i = 1; i < n; i++) {
data[i] = data[i-1] + input[i];
}
In this case, each iteration depends directly on the value computed in the i-1 iteration. If multiple threads
execute different iterations simultaneously, they will be reading data[i-1] before it has been correctly updated
by the thread handling the preceding iteration, leading to a race condition.

Solution: OpenMP itself doesn't automatically resolve loop-carried dependencies. The programmer must
refactor the algorithm to remove the dependency. In some cases, this is not possible, and the loop cannot
be parallelized with a simple omp for directive. For specific patterns like reductions (see below), OpenMP
provides dedicated clauses.

Race Conditions and Data Sharing: Who Owns What?

Even without explicit loop-carried dependencies, incorrect management of shared data can lead to race
conditions. This occurs when multiple threads attempt to read and write to the same memory location
without proper synchronization, resulting in unpredictable outcomes.

Challenges:

• Shared vs. Private Variables: Deciding which variables should be accessible by all threads
(shared) and which should have a unique copy for each thread (private) is crucial. A common
mistake is to implicitly share a temporary variable used within the loop, leading to data corruption.
• Reductions: A frequent pattern in loops is the accumulation of a value, such as summing up all
elements of an array. A naive parallelization would cause multiple threads to update the shared
sum variable concurrently, leading to lost updates.

Solutions:

• Data Scoping Clauses: OpenMP provides clauses to control the data environment:
o private(list): Each thread gets its own private copy of the variables in the list.
o firstprivate(list): Similar to private, but the private copy is initialized with the value of the master
thread's variable before the parallel region.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 1 | 59
Unit 5: OpenMP - Summary
o lastprivate(list):
The value of the private variable from the sequentially last iteration is copied
back to the master thread's variable after the loop.
o shared(list): Variables in the list are shared among all threads. This is often the default.
• reduction Clause: For operations like summation, multiplication, or logical operations, the reduction
clause is the ideal solution. It creates a private copy of the reduction variable for each thread,
performs the operation locally, and then combines the results from all threads into the final shared
variable after the loop.

#pragma omp parallel for reduction(+:sum)


for (int i = 0; i < n; i++) {
sum += array[i];
}
Load Balancing: Keeping All Threads Busy

Not all loop iterations take the same amount of time to execute. If the workload is unevenly distributed,
some threads may finish their assigned iterations long before others, leading to idle cores and inefficient
parallelization. This is known as load imbalance.

Challenges:

• Static Work Distribution: By default, OpenMP often divides the loop iterations statically among
threads. If the workload is concentrated in the early or late iterations, this static division is
suboptimal.

Solutions:

• The schedule Clause: OpenMP provides the schedule clause to control how loop iterations are
distributed among threads:
o schedule(static, chunk_size): Divides iterations into fixed-size chunks and assigns them to
threads in a round-robin fashion. This has low overhead but can be inefficient for imbalanced
loads.
o schedule(dynamic, chunk_size): Threads request a chunk of iterations as they become available.
This provides better load balancing but has higher overhead due to the dynamic assignment.
o schedule(guided, chunk_size): Starts with large chunks and progressively reduces the chunk
size. This is a compromise between static and dynamic scheduling, often providing good
performance.
o schedule(auto): The compiler and runtime system choose the most appropriate schedule.

The Hidden Performance Killer: False Sharing

Even when data is correctly declared as private, performance can be unexpectedly poor due to a
hardware-level issue called false sharing. This occurs when private variables of different threads happen
to reside on the same cache line. A cache line is the smallest unit of memory that can be transferred
between the main memory and the CPU cache.

When one thread writes to its private variable, the entire cache line is invalidated for all other threads that
have a copy of it, even if they are accessing different variables within that same cache line. This forces
the other threads to fetch the updated cache line from a higher level of memory, a much slower operation.

Solution: The primary way to mitigate false sharing is to ensure that private variables used by different
threads are located on different cache lines. This can be achieved through padding, by adding unused
variables to increase the memory separation between the critical variables. Some compilers and libraries
may also provide alignment directives.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 2 | 59
Unit 5: OpenMP - Summary
Overhead: The Cost of Parallelism

Creating and managing threads is not free. There is an inherent overhead associated with starting a
parallel region, synchronizing threads, and distributing work. For loops with very little work per iteration,
the overhead of parallelization can outweigh the benefits, leading to a slowdown compared to the
sequential version.

Solution: As a rule of thumb, only parallelize loops that have a significant amount of work. For nested
loops, it is generally more efficient to parallelize the outer loop, as this minimizes the number of times a
parallel region is entered and exited. The collapse clause can be used to parallelize multiple nested loops
together, increasing the total number of iterations to be distributed.

I. OpenMP is a portable, scalable model for shared-memory


parallel programming in C, C++, and Fortran. It simplifies
the development of parallel applications on multi-core
systems by using compiler directives (like #pragma omp)
and a runtime library. One of the most common applications
is loop-level parallelism, but this is not always
straightforward due to various challenges. Below is a
detailed explanation.
Topic: Challenges in Threading a Loop Using OpenMP

1. Loop-Carried Dependence (Data Dependency)


What it is:

A loop-carried dependence occurs when an iteration of a loop depends on the result of a previous iteration.

Why it's a problem:

If iteration i must wait for iteration i-1, then you cannot parallelize these iterations without introducing incorrect
behavior or data races.

Example:
int A[100];
for (int i = 1; i < 100; i++) {
A[i] = A[i-1] + 1; // i-th value depends on (i-1)-th value
}

Result:

• Cannot parallelize directly.


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 3 | 59
Unit 5: OpenMP - Summary
• Must refactor the loop or use other algorithms (e.g., prefix sum).

2. Loop Iteration Independence


What’s needed for safe threading:

To parallelize, iterations must be independent.

// This is parallelizable
int A[100];
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
A[i] = i * 2;
}

Here, each A[i] is independent. So OpenMP can safely divide iterations across threads.

3. Reduction Variables (Summation, etc.)


What it is:

Sometimes, all threads contribute to a single shared variable, such as sum or max.

Why it's a problem:

This introduces a race condition if not handled properly.

OpenMP Solution:

Use reduction clause:

int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 100; i++) {
sum += i;
}

OpenMP handles:

• Local copy per thread


• Combines results after loop ends

4. Shared vs Private Variables


What it is:

Correct use of shared and private clauses is crucial to avoid unexpected side effects.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 4 | 59
Unit 5: OpenMP - Summary
Common mistake:
int temp;
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
temp = i * 2; // All threads write to same 'temp'!
// ... do something with temp ...
}

Fix:
#pragma omp parallel for private(temp)
for (int i = 0; i < 100; i++) {
int temp = i * 2; // Each thread has its own 'temp'
}

5. Load Imbalance
What it is:

Some loop iterations may take longer than others, causing some threads to finish earlier and sit idle.

Problem:

• Leads to under-utilization of CPU cores.

Solution: Use schedule clause


#pragma omp parallel for schedule(dynamic, 10)
for (int i = 0; i < N; i++) {
// Each chunk of 10 iterations is dynamically assigned
}
Schedule Type Description
static Assigns iterations in chunks ahead of time (good for uniform workload)
dynamic Assigns chunks at runtime (good for uneven work)
guided Like dynamic but chunk size shrinks
auto Let compiler/runtime decide

6. Loop Index Control


Not supported:

Loops must follow simple patterns:

for (init; condition; increment)

Supported:
for (int i = 0; i < N; i++) // OK
for (int i = N; i > 0; i--) // OK

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 5 | 59
Unit 5: OpenMP - Summary
Not supported:
for (int i = 0; i < N; i *= 2) // BAD

Such complex loop control expressions cannot be parallelized using #pragma omp parallel for.

7. Nested Loops
Problem:

• OpenMP doesn’t automatically parallelize nested loops unless specified.


• Outer loop gets parallelized by default.

Solution:
#pragma omp parallel for collapse(2)
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 10; j++) {
A[i][j] = i + j;
}
}

• collapse(2) merges 2 loops into a single iteration space.


• Good for 2D data or matrix ops.

8. False Sharing
What it is:

Occurs when threads update variables close in memory, causing cache invalidation.

Example:
int A[4];
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
A[i] = i; // May reside on the same cache line
}

Fix:

• Padding (e.g., A[4][CACHE_LINE_SIZE])


• Use of thread-local storage

Summary Table: Loop Threading Challenges

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 6 | 59
Unit 5: OpenMP - Summary
Challenge Cause Solution
Loop-Carried Dependency Iteration i depends on i-1 Refactor loop, prefix sum
Reduction Multiple threads updating same variable Use reduction(+:sum)
Shared Variables Unintended variable sharing Use private(var)
Load Imbalance Uneven iteration times Use schedule(dynamic)
Complex Loops Non-linear index update Refactor loop
Nested Loops Only outer loop parallelized Use collapse(n)
False Sharing Cache conflicts Add padding or use local vars

II. OpenMP Loop Threading Challenges: C Example with


Detailed Line-by-Line Explanation
// File: loop_challenges_openmp.c
#include <stdio.h>
#include <omp.h>

#define SIZE 10

int main() {
int A[SIZE], sum = 0;

// --- [1] Loop-Carried Dependence ---


// Cannot be parallelized due to dependency between iterations
A[0] = 1;
for (int i = 1; i < SIZE; i++) {
A[i] = A[i - 1] + 1;
}

printf("Loop-Carried Dependence Result: ");


for (int i = 0; i < SIZE; i++) printf("%d ", A[i]);
printf("\n");

// --- [2] Correct Parallel Reduction ---


sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < SIZE; i++) {
sum += i;
}

printf("Parallel Reduction Result: %d\n", sum);

// --- [3] Shared vs Private Variables ---


// Incorrect version: Shared temp causes race condition
printf("Incorrect Shared Temp (Race Condition):\n");
int temp; // Shared variable
#pragma omp parallel for
for (int i = 0; i < SIZE; i++) {
temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}

// Correct version: temp is private to each thread


printf("Correct Private Temp:\n");
#pragma omp parallel for private(temp)
for (int i = 0; i < SIZE; i++) {
temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}

// --- [4] Load Imbalance Demo ---


// Some iterations are artificially heavier
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 7 | 59
Unit 5: OpenMP - Summary
printf("Dynamic Scheduling to Handle Load Imbalance:\n");
#pragma omp parallel for schedule(dynamic, 2)
for (int i = 0; i < SIZE; i++) {
for (volatile int j = 0; j < (i+1)*1000000; j++); // Simulate heavy work
printf("Thread %d completed iteration %d\n", omp_get_thread_num(), i);
}

return 0;
}

Line-by-Line Explanation
Header and Setup
#include <stdio.h>
#include <omp.h>
#define SIZE 10

• stdio.h for printf


• omp.h is required for OpenMP functions
• SIZE is loop bound (10 iterations)

[1] Loop-Carried Dependence (Not Parallelizable)


A[0] = 1;
for (int i = 1; i < SIZE; i++) {
A[i] = A[i - 1] + 1;
}

• Each A[i] depends on the value from A[i-1]


• Cannot be parallelized safely, so we run it sequentially

[2] Reduction with reduction(+:sum)


sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < SIZE; i++) {
sum += i;
}

• Each thread gets a private copy of sum


• Final sum is automatically combined after the loop
• Prevents race condition

[3] Shared vs Private Variable

Incorrect: temp is shared

int temp;
#pragma omp parallel for
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 8 | 59
Unit 5: OpenMP - Summary
for (int i = 0; i < SIZE; i++) {
temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}

• All threads write to the same memory address, causing a race condition
• Output will be interleaved and inconsistent

Correct: temp is private

#pragma omp parallel for private(temp)


for (int i = 0; i < SIZE; i++) {
temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}

• Each thread has its own version of temp


• Avoids side effects and race conditions

[4] Load Imbalance with Dynamic Scheduling


#pragma omp parallel for schedule(dynamic, 2)
for (int i = 0; i < SIZE; i++) {
for (volatile int j = 0; j < (i+1)*1000000; j++); // simulate work
printf("Thread %d completed iteration %d\n", omp_get_thread_num(), i);
}

• Each iteration takes more time ((i+1)*1M loop) → non-uniform workload


• schedule(dynamic, 2) gives work in chunks of 2, distributing unevenly
• Threads are reused as they finish → better load balancing

How to Compile and Run


gcc -fopenmp loop_challenges_openmp.c -o loop_challenges
./loop_challenges

2.OpenMP and the Challenge of Loop-Carried


Dependence
In the realm of parallel computing with OpenMP, loop-carried dependence is the single most critical
concept to understand and respect. It is the primary obstacle that prevents a simple for loop from being
safely and correctly parallelized. Attempting to parallelize a loop with an unhandled dependence will almost
certainly lead to incorrect results, creating a subtle and frustrating class of bugs known as race conditions.

What is Loop-Carried Dependence?

A loop-carried dependence exists when an iteration of a loop relies on the outcome of a previous iteration.
The "dependence" is "carried" from one iteration to another, imposing a strict sequential order of execution.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 9 | 59
Unit 5: OpenMP - Summary
OpenMP's parallel for construct works by breaking a loop's iterations into chunks and assigning them to
different threads to be executed simultaneously. If a dependence exists, this simultaneous execution
violates the required sequential order, causing threads to read variables before they are correctly updated
or to overwrite values needed by other threads.

Let's break down the different types of loop-carried dependencies with clear examples.

1. Flow Dependence (True Dependence)

This is the most common and intuitive type of dependence. It occurs when a statement in a later iteration
reads a value written by a statement in an earlier iteration.

Classic Example: The Prefix Sum

C
// data_in = {3, 5, 2, 8, ...}
int data_out[N];
data_out[0] = data_in[0];

for (int i = 1; i < N; i++) {


// READS data_out[i-1] which was WRITTEN in the previous iteration
data_out[i] = data_out[i-1] + data_in[i];
}
• Dependence: The calculation of data_out[i] is dependent on the final value of data_out[i-1].
• Why OpenMP Fails: Imagine Thread 0 is assigned i = 1 and Thread 1 is assigned i = 2.
o Thread 1 needs data_out[1] to compute data_out[2].
o However, Thread 0 might not have finished calculating and writing to data_out[1] before
Thread 1 attempts to read it.
o Thread 1 will likely read an old, incorrect value from data_out[1], breaking the entire calculation
chain.

Special Case: Reductions

A reduction is a specific type of flow dependence where a single variable accumulates a result.

C
double sum = 0.0;
for (int i = 0; i < N; i++) {
sum = sum + array[i]; // READS and then WRITES to 'sum'
}
Here, each iteration reads the value of sum from the previous iteration, adds to it, and writes it back.
Parallelizing this naively would cause multiple threads to read and write to sum simultaneously, overwriting
each other's work.

OpenMP Solution for Reductions: This pattern is so common that OpenMP provides a specific clause
to handle it safely: reduction.

C
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum = sum + array[i];
}
The reduction(+:sum) clause tells OpenMP to create a private, thread-local copy of sum for each thread
(initialized to 0). Each thread calculates its partial sum. At the end of the loop, OpenMP automatically and
safely combines all the partial sums into the final global sum variable.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 10 | 59
Unit 5: OpenMP - Summary

2. Anti-Dependence

An anti-dependence occurs when a statement in a later iteration writes to a memory location that is read
by a statement in an earlier iteration. The order is critical: the read must happen before the write.

Example:

C
for (int i = 0; i < N - 1; i++) {
// Iteration 'i' READS a[i+1]
a[i] = b[i] + a[i+1];
// Iteration 'i+1' will WRITE to a[i+1]
}
• Dependence: Iteration i needs to read the original value of a[i+1] before iteration i+1 overwrites it.
• Why OpenMP Fails: Imagine Thread 0 gets i = 0 and Thread 1 gets i = 1.
o Thread 0 needs to read the original a[1].
o Thread 1 needs to calculate and write a new value to a[1].
o If Thread 1 executes first, it will overwrite a[1], and Thread 0 will then read the new, incorrect
value, violating the program's logic.

Solution: Often, anti-dependencies can be resolved by restructuring the code. In this case, creating a
temporary copy of the array a could break the dependence, though at the cost of memory.

3. Output Dependence

An output dependence occurs when two different iterations write to the same memory location. The order
of the writes is crucial for the final state of the variable.

Example:

C
for (int i = 0; i < N; i++) {
// Every iteration writes to the same variable 'x'
x = compute_something(i);
a[i] = x * 2;
}
• Dependence: The final value of x after the loop should be the one written by the very last iteration
(i = N-1).
• Why OpenMP Fails: With parallel execution, threads handling different iterations will be writing to
x in an unpredictable order. The final value of x after the parallel loop is indeterminate—it could be
the result from any of the threads, not necessarily the last one.

Solution: The use of the shared variable x is the problem. This can be resolved by making x local to the
loop's scope.

C
// Corrected Version
#pragma omp parallel for
for (int i = 0; i < N; i++) {
double x = compute_something(i); // 'x' is now private to each iteration
a[i] = x * 2;
}
By declaring x inside the loop, each thread gets its own private copy. OpenMP automatically handles this
for variables declared within the lexical scope of the loop. If x were declared outside the loop, you would
need to explicitly mark it with #pragma omp parallel for private(x).

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 11 | 59
Unit 5: OpenMP - Summary
How to Handle Loop-Carried Dependencies

Since OpenMP's parallel for directive does not automatically resolve these dependencies (except for the
reduction pattern), the responsibility falls to the programmer:

1. Identify the Dependence: Before adding any #pragma omp, you must analyze the loop to see if any
iteration depends on another. Ask yourself: "Does iteration i use a value calculated in iteration i-1?"
or "Do multiple iterations write to the same location where the final value matters?"
2. Choose a Strategy:
o Do Not Parallelize: If a true dependence is fundamental to the algorithm (like the prefix
sum), the loop cannot be parallelized as-is.
o Use reduction: If the dependence is a reduction pattern, use the reduction clause. This is the
simplest and most common solution for its specific pattern.
o Restructure the Algorithm: The most powerful technique is to rethink and rewrite the
algorithm to remove the dependence entirely. This might involve using temporary arrays,
changing the order of calculations, or adopting a completely different parallel-friendly
algorithm (e.g., a parallel prefix sum algorithm instead of a sequential one).
o Use Explicit Synchronization: In rare and complex cases, you might use #pragma omp critical
or #pragma omp atomic inside a loop. However, this introduces significant overhead and can
serialize the execution, defeating the purpose of parallelism. It is generally a last resort and
an indicator that a parallel for might not be the right tool for the job.

I. What is a Loop-Carried Dependence?


A loop-carried dependence occurs when one iteration of a loop depends on the result of a previous iteration.

Key Feature:

• Iteration i reads or writes data that iteration i-1 has written or read.
• This creates a sequential dependency, blocking parallel execution.

Why It’s a Problem in OpenMP


OpenMP relies on parallel execution of independent loop iterations.

If each iteration depends on the previous one’s result, you cannot safely distribute those iterations across multiple
threads.

Example 1: Loop with Dependence (not parallelizable)


int A[100];
A[0] = 1;
for (int i = 1; i < 100; i++) {
A[i] = A[i - 1] + 1;
}

Line-by-Line:

• A[0] = 1;: Initializes the first element.


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 12 | 59
Unit 5: OpenMP - Summary
• A[i] = A[i-1] + 1;: Each A[i] depends on the previous A[i-1].

Why It Can’t Be Parallelized:

• If you tried to do this with OpenMP:


• #pragma omp parallel for
• for (int i = 1; i < 100; i++) {
• A[i] = A[i - 1] + 1; // Data race here!
• }
• Multiple threads could read and write to adjacent elements out of order, producing incorrect results.

Visualization: Dependency Chain


Iteration: i=1 i=2 i=3 i=4 ...
↓ ↓ ↓ ↓
Dependency: A[1]←A[0], A[2]←A[1], A[3]←A[2]...

• Each computation waits for the one before it to finish.

Example 2: Loop without Dependence (parallelizable)


int B[100];
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
B[i] = i * 2;
}

• Here, every B[i] is independent — no reliance on B[i-1].

Types of Loop-Carried Dependence


Type Example Can Be Fixed?
True dependence A[i] = A[i-1] + 1; Usually not
Anti-dependence A[i-1] = A[i] + 1; Sometimes
Output dependence A[i] = ..., A[i] = ... Rename variable

Workaround Techniques
1. Transform the algorithm

Use prefix sums, parallel scan, or block-based methods:

// Not simple, but scan algorithms remove loop-carried dependence

2. Pipeline the loop manually


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 13 | 59
Unit 5: OpenMP - Summary
Break the loop into sequential + parallel parts if only some elements depend.

Summary Table
Feature Explanation
Loop-Carried Dependence When an iteration depends on a previous one
Problem Prevents safe parallelization in OpenMP
Example A[i] = A[i - 1] + 1
Solution Redesign algorithm (prefix sum, recursion, or pipeline)
Parallel-Safe? No, if true data dependency exists

II. Here's a refactored OpenMP implementation using the prefix sum


(inclusive scan) technique that eliminates loop-carried
dependence, along with a detailed line-by-line explanation.

Problem: Loop-Carried Dependence


Original (sequential and not parallelizable):

A[0] = 1;
for (int i = 1; i < N; i++) {
A[i] = A[i-1] + input[i];
}

Here, A[i] depends on A[i-1], causing a true loop-carried dependence.

Goal: Parallel Prefix Sum (Scan)


We’ll convert the loop into a parallel version using the Blelloch scan (two-pass algorithm):

Refactored Code Using OpenMP – Line-by-Line Explanation


// File: parallel_prefix_sum.c
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>

#define N 8

int main() {
int input[N] = {1, 2, 3, 4, 5, 6, 7, 8};
int prefix[N];

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 14 | 59
Unit 5: OpenMP - Summary
int i, j;

int logN = 3; // log2(8) = 3

• We use input[] as the array we want to prefix-sum.


• prefix[] will hold the result.
• We will simulate a parallel scan with OpenMP (on small input for clarity).
• logN is the number of passes required.

Step 1: Copy Input to Prefix


for (i = 0; i < N; i++) {
prefix[i] = input[i]; // Copy input to prefix
}

Step 2: Upsweep (Reduce) Phase


for (int d = 0; d < logN; d++) {
int stride = 1 << d; // 2^d
#pragma omp parallel for
for (i = stride; i < N; i += (stride << 1)) {
prefix[i + stride - 1] += prefix[i - 1];
}
}

• stride = 2^d increases each round: 1, 2, 4.


• Each pass adds left neighbors to simulate a tree structure.
• Parallel for is safe here because each update targets distinct indices.
• Upsweep builds partial sums.

Step 3: Set Last Element to 0 (Exclusive Prefix Sum)


prefix[N - 1] = 0;

• In exclusive scan, last element is initialized to 0 before down-sweep.

Step 4: Downsweep Phase


for (int d = logN - 1; d >= 0; d--) {
int stride = 1 << d;
#pragma omp parallel for
for (i = stride; i < N; i += (stride << 1)) {
int t = prefix[i - 1];
prefix[i - 1] = prefix[i + stride - 1];
prefix[i + stride - 1] += t;
}
}

• This pass distributes prefix sums to correct positions.


• Swaps and adds maintain correctness while preserving parallelism.
• All operations are independent per thread chunk.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 15 | 59
Unit 5: OpenMP - Summary

Step 5: Print Results


printf("Prefix sum result:\n");
for (i = 0; i < N; i++) {
printf("%d ", prefix[i]);
}
printf("\n");

return 0;
}

Output for input = {1,2,3,4,5,6,7,8} :


Prefix sum result:
0 1 3 6 10 15 21 28

• Note: this is exclusive scan. To get inclusive version:

prefix[i] += input[i];

...after the final phase.

Summary: Two-Pass Prefix Sum Logic


Phase Name Purpose
Upsweep Reduction Builds partial sums in tree fashion
Downsweep Distribution Computes final prefix values

🛠 Compile and Run


gcc -fopenmp parallel_prefix_sum.c -o prefix
./prefix

3.OpenMP and the Peril of Data-Race Conditions


In the world of parallel programming, a data-race condition is one of the most common and dangerous
types of bugs. It occurs when two or more threads in a parallel region attempt to access the same memory
location concurrently, and at least one of those accesses is a write operation. The result of this "race" is
unpredictable and non-deterministic, as the final value of the memory location depends on the precise,
and often uncontrollable, timing of thread execution. OpenMP, while simplifying many aspects of
threading, does not automatically prevent data races; it provides the programmer with the tools to manage
them, making a thorough understanding essential for writing correct parallel code.

The Anatomy of a Data Race

At its core, a data race has three key ingredients:

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 16 | 59
Unit 5: OpenMP - Summary
1. Shared Data: A variable or memory location is accessible by multiple threads.
2. Concurrent Access: Multiple threads try to read from or write to this shared data at the same time.
3. At Least One Write: At a minimum, one of the threads is attempting to modify the data. If all threads
were only reading, there would be no data race.

The outcome is that the program's result can change each time it is run, depending on which thread "wins"
the race. This non-deterministic behavior makes data races notoriously difficult to debug.

Classic Example: The Unprotected Counter

Consider a simple loop designed to count the number of elements in an array that meet a certain condition.

The Sequential (Correct) Code:

C
int count = 0;
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
count++; // This is the critical operation
}
}
// 'count' will hold the correct total
Now, let's try to parallelize this with OpenMP in a naive way:

The Parallel (Incorrect) Code with a Data Race:

C
int count = 0;
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
count++; // DATA RACE!
}
}
// 'count' will almost certainly be wrong!
Why does this fail?

By default, the count variable declared outside the parallel region is shared among all threads. The operation
count++ is not a single, instantaneous machine instruction. It typically involves three separate steps:

1. Read: Read the current value of count from memory into a CPU register.
2. Increment: Add one to the value in the register.
3. Write: Write the new value from the register back to count in memory.

Now, imagine two threads, Thread A and Thread B, executing this concurrently:

Time Thread A Action Thread B Action Value of count in Memory


1 Reads count (gets 5) - 5
2 - Reads count (gets 5) 5
3 Increments its register (to 6) - 5
4 - Increments its register (to 6) 5
5 Writes 6 to count - 6
6 - Writes 6 to count 6
Two increments occurred, but the final value is 6, not 7. This is a "lost update." Because millions of such
operations can happen in any order, the final result is garbage.
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 17 | 59
Unit 5: OpenMP - Summary

Solving Data Races in OpenMP

OpenMP provides a suite of solutions to prevent data races, centered around two main strategies:
avoiding sharing and controlling access.

1. Avoiding Sharing: Data-Sharing Attribute Clauses

The best way to prevent a data race is to eliminate shared data altogether. If each thread has its own
private copy of a variable, there is no possibility of conflict.

• private(list):
This is the most fundamental clause. It declares that each thread should have its own
uninitialized, private copy of the variables in the list.

Example: Fixing a Temporary Variable

// INCORRECT: 'temp' is shared, causing a race


double temp;
#pragma omp parallel for
for (int i = 0; i < N; i++) {
temp = input[i] * 5.0; // Race: all threads write to the same 'temp'
output[i] = process(temp);
}

// CORRECT: Each thread gets its own 'temp'


double temp;
#pragma omp parallel for private(temp)
for (int i = 0; i < N; i++) {
temp = input[i] * 5.0;
output[i] = process(temp);
}
// Note: A better way is to declare 'temp' inside the loop, which makes it implicitly private.
• firstprivate(list): Like private, but the private copy for each thread is initialized
with the value of the
master thread's variable before the parallel region begins.
• lastprivate(list): Like private, but the value from the thread that executes the sequentially last iteration
is copied back to the master thread's variable after the loop.

2. Controlling Access: Synchronization

Sometimes, data must be shared. The classic counter example needs to update a single, final total. In
these cases, OpenMP provides mechanisms to ensure that only one thread can access the shared data
at a time.

• reduction(operator:list):
This is the preferred solution for the common data race pattern involving
accumulation (sum, product, max, min, etc.). It elegantly solves the problem by creating private
copies for the operation and then safely combining them at the end.

Example: The Correct Way to Count

int count = 0;
// OpenMP creates a private 'count' for each thread, initialized to 0.
// At the end, it safely adds all private counts to the global 'count'.
#pragma omp parallel for reduction(+:count)
for (int i = 0; i < 100000; i++) {

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 18 | 59
Unit 5: OpenMP - Summary
if (array[i] > some_value) {
count++; // Each thread increments its own private copy
}
}
// 'count' is now correct!
• critical: This directivedefines a block of code (a "critical section") that can only be executed by one
thread at a time. It's a more general, but also more heavyweight, solution than reduction.

Example: Using a Critical Section

int count = 0;
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
#pragma omp critical
{
count++;
}
}
}
// This is correct, but slower than reduction due to higher overhead.
• atomic: This directive applies only to the single C/C++ statement that immediately follows it. It tells
OpenMP to ensure this specific update to a memory location happens atomically (as an
uninterruptible operation). It is more lightweight than critical but is limited to simple update statements
(like x++, x--, x += val).

Example: Using Atomic

int count = 0;
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
if (array[i] > some_value) {
#pragma omp atomic
count++;
}
}
// Correct and generally more efficient than a critical section for this use case.
Summary: A Hierarchy of Solutions

When faced with potential data races in OpenMP, follow this general thought process:

1. Identify Shared Data: First, determine which variables are shared among threads.
2. Privatize When Possible: Can the variable be made private? If it's a temporary variable used only
within a loop iteration, declare it inside the loop or use the private clause. This is the best solution as
it involves no synchronization overhead.
3. Use reduction for Accumulators: If the operation is a sum, product, or other reduction, use the
reduction clause. It is optimized for this exact pattern and is far more efficient than manual locking.
4. Use atomic for Simple Updates: If you need to protect a simple, single update to shared memory
that isn't a reduction, atomic is the next best choice due to its lower overhead.
5. Use critical for Complex Updates: If you need to protect a larger, more complex block of code
involving multiple statements that must be executed together without interruption, use a critical
section. Use this sparingly, as it can significantly reduce parallelism.

I. What is a Data Race


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 19 | 59
Unit 5: OpenMP - Summary
A data race occurs when:

• Two or more threads access the same memory location


• At least one of the accesses is a write
• The accesses are not synchronized

Result: Unpredictable behavior, corrupted data, or crashes

Example of a Data Race


#include <stdio.h>
#include <omp.h>

int main() {
int sum = 0;

#pragma omp parallel for


for (int i = 0; i < 100; i++) {
sum += i; // Shared variable: Unsafe
}

printf("Sum = %d\n", sum);


return 0;
}

Explanation:

• sum is shared among all threads.


• Each thread performs sum += i at the same time.
• The += operation is not atomic; it’s a read-modify-write.

Data Race Timeline (Conceptual)


Time Thread A Thread B Final Result
t1 Reads sum = 10
t2 Reads sum = 10
t3 sum += 5 → 15
t4 sum += 7 → 17 LOST 5

Final sum should be 10 + 5 + 7 = 22, but due to overlap, we only got 17.

How to Fix: Use OpenMP reduction


int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 100; i++) {
sum += i; // Now safe
}
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 20 | 59
Unit 5: OpenMP - Summary
Explanation:

• OpenMP creates a private sum for each thread.


• After the loop, OpenMP adds them up safely.
• No overlapping memory access → No data race

Another Example: Race on Array Element


int A[100] = {0};

#pragma omp parallel for


for (int i = 0; i < 100; i++) {
A[i % 10] += i; // 10 shared elements used by 100 threads!
}

Problem:

• Threads collide on same 10 slots (A[0] to A[9])


• Results will vary run-to-run

How to Fix: Avoid Overlap


#pragma omp parallel for
for (int i = 0; i < 100; i++) {
int id = omp_get_thread_num();
// Write to thread-private memory or synchronize
}

Or apply locks or atomic operations.

Other Ways to Prevent Data Races in OpenMP


Method Description
reduction Safe accumulation over shared variable
critical One thread at a time executes this block
atomic Lightweight atomic updates
private Each thread gets its own variable instance
nowait Avoid unnecessary barriers between threads
barrier Forces sync when needed

#pragma omp critical

#pragma omp critical


{
sum += i;

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 21 | 59
Unit 5: OpenMP - Summary
}

• Slower than reduction, but good for complex code inside the block

⚛ #pragma omp atomic

#pragma omp atomic


sum += i;

• Only works on simple assignments or updates


• More efficient than critical

Summary Table
Aspect Explanation
What is Data Race? Uncoordinated access to shared memory
Cause Multiple threads writing to same variable
Symptoms Inconsistent outputs, crashes
Detection Use compiler flags (-fopenmp -fsanitize=thread)
Solutions reduction, private, critical, atomic

4.OpenMP: A Detailed Guide to Managing Shared and


Private Data
In OpenMP, understanding and correctly managing the distinction between shared and private data is
the most fundamental skill for writing correct and efficient parallel programs. This control over data scoping
is the primary mechanism for preventing data-race conditions and ensuring that each thread works on the
correct set of data. Mismanagement of data scope is the number one cause of bugs in OpenMP
programming.

The Core Concept: Shared vs. Private

When you initiate a parallel region in OpenMP (e.g., with #pragma omp parallel), you create a team of threads.
These threads exist in a shared memory space, but they each have their own execution stack. This leads
to two fundamental data scopes:

• Shared Data: A single instance of a variable exists in memory, and all threads can read from and
write to it. This is powerful for collaboration but is the main source of data races if not handled
carefully. Any variable declared outside the parallel region is, by default, shared.
• Private Data: Each thread gets its very own copy of the variable. One thread's private copy is
completely invisible and inaccessible to another thread. This is the safest way to work with
temporary or loop-specific variables. Variables declared inside the parallel region (or inside a loop
that is parallelized) are, by default, private.

The default Clause: Setting the Ground Rules

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 22 | 59
Unit 5: OpenMP - Summary
OpenMP allows you to define the default behavior for variables that you don't explicitly scope. The default
clause on a parallel directive can take one of three main arguments:

• default(shared): This is the most common default behavior in many OpenMP implementations. Any
variable not explicitly marked otherwise will be shared among all threads.
• default(private): This is not a valid option in the OpenMP standard. You cannot make all variables
private by default.
• default(none): This is the recommended practice for serious OpenMP development. It forces the
programmer to explicitly specify the data-sharing attribute for every single variable used within the
parallel region that is declared outside of it. While it requires more typing, it eliminates ambiguity
and forces you to think about data scope, preventing countless bugs.

Example using default(none):

C
int a, b, c;
#pragma omp parallel for default(none) shared(a) private(b, c)
for (int i = 0; i < N; i++) {
b = a + i; // OK: 'b' is private, 'a' is shared (read-only is safe)
c = b * 2; // OK: 'c' is private
// ...
}
If you forgot to scope a, b, or c, the compiler would throw an error, forcing you to fix it.

The Data-Sharing Attribute Clauses: Your Toolkit for Control

OpenMP provides a powerful set of clauses to precisely control the scope of each variable. These are
typically used with directives like omp parallel or omp for.

1. shared(list)

This clause explicitly declares that the variables in the list are to be shared among all threads. Since this
is often the default, it's most useful when you've set default(none).

• Use Case: When multiple threads need to access or update the same piece of data, such as a
large input array that is only being read, or a result array where each thread works on a different
section.
• Warning: Any shared variable that is written to concurrently by multiple threads must be protected
by synchronization (atomic, critical, reduction, or locks) to prevent data races.

2. private(list)

This is the most crucial clause for preventing data races with temporary variables. It declares that each
thread should get its own private, uninitialized copy of the variables in the list.

• Use Case: For temporary variables or counters used only within a single loop iteration.
• Key Point: The private copy is uninitialized. Its value is undefined at the start of the parallel region.
Likewise, any value assigned to the private copy inside the parallel region is discarded when the
thread finishes; it does not affect the original variable outside the region.

Example:

C
int temp = 5;
#pragma omp parallel for private(temp)

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 23 | 59
Unit 5: OpenMP - Summary
for (int i = 0; i < 4; i++) {
// 'temp' inside this loop is a new, uninitialized variable for each thread.
// It is NOT 5.
temp = i * 10;
printf("Thread %d, temp = %d\n", omp_get_thread_num(), temp);
}
// After the loop, the original 'temp' is still 5.
printf("Outside loop, temp = %d\n", temp);
3. firstprivate(list)

This clause behaves exactly like private but with one key difference: the private copy for each thread is
initialized with the value of the original variable from the master thread before the parallel region begins.

• Use Case: When threads need a starting value for a calculation that is based on a value computed
before the parallel section.

Example:

C
int start_value = 100;
#pragma omp parallel for firstprivate(start_value)
for (int i = 0; i < 4; i++) {
// Each thread's 'start_value' begins at 100.
printf("Thread %d, start_value = %d\n", omp_get_thread_num(), start_value);
start_value += i; // This only modifies the private copy
}
// After the loop, the original 'start_value' is still 100.
printf("Outside loop, start_value = %d\n", start_value);
4. lastprivate(list)

This clause is the counterpart to firstprivate. It creates an uninitialized private copy for each thread, but after
the parallel region, it copies the value from the thread that executed the sequentially last iteration back to
the original variable.

• Use Case: When you need to retrieve the result of a calculation that happened in the final iteration
of a loop.

Example:

C
int final_val;
#pragma omp parallel for lastprivate(final_val)
for (int i = 0; i < 1000; i++) {
final_val = i * i; // Each thread has its own 'final_val'
}
// After the loop, the 'final_val' from the thread that ran i=999
// is copied to the original 'final_val'.
// So, final_val will be 999 * 999 = 998001.
5. reduction(operator:list)

The reduction clause is a specialized, highly optimized solution for a common pattern of data sharing: when
multiple threads need to update a single variable using an associative mathematical operator (like +, *, -,
&, |).

• How it Works:
1. A private copy of the reduction variable is created for each thread.
2. This private copy is initialized to the identity value of the operator (0 for +, 1 for *, etc.).
3. Each thread performs its calculations on its private copy.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 24 | 59
Unit 5: OpenMP - Summary
4. At the end of the region, OpenMP automatically and safely combines the values from all
private copies into the original global variable.
• Use Case: Summing an array, finding a maximum/minimum value, performing bitwise operations.
This should always be preferred over manual protection with atomic or critical for these operations
as it is much more efficient.

Example:

C
long long sum = 0;
#pragma omp parallel for reduction(+:sum)
for (long long i = 0; i < 1000000; i++) {
sum += data[i]; // No data race here! Each thread adds to its private sum.
}
// After the loop, all private sums are combined into the global 'sum'.
The threadprivate Directive

Distinct from the clauses above, threadprivate is a directive used to create a global variable that persists
across multiple parallel regions but remains private to each thread.

C
int counter;
#pragma omp threadprivate(counter)

int main() {
#pragma omp parallel
{
// Each thread gets its own 'counter' which is initialized once (usually to 0)
counter = omp_get_thread_num();
} // End of first parallel region

// ... other sequential code ...

#pragma omp parallel


{
// When the second parallel region starts, each thread's 'counter'
// still holds the value it was assigned in the first region.
// It is not re-initialized.
printf("Thread %d, my counter is %d\n", omp_get_thread_num(), counter);
}
return 0;
}
Performance Implication: False Sharing

Even with perfect data scoping, performance can be crippled by false sharing. This occurs when private
variables used by different threads happen to be located on the same cache line. When one thread writes
to its private variable, the hardware invalidates the entire cache line for all other threads, forcing them to
re-fetch it from main memory, even though they aren't using the same variable. This creates massive,
hidden synchronization overhead.

Solution: Manually pad your data structures to ensure that variables that will be accessed by different
threads are on different cache lines. This often involves adding unused buffer space.

By mastering these data-sharing concepts and clauses, you gain full control over your OpenMP programs,
enabling you to write parallel code that is not only correct but also scalable and efficient.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 25 | 59
Unit 5: OpenMP - Summary
I. In OpenMP, controlling whether a variable is shared or
private across threads is critical for both correctness and
performance.

Why This Matters


When multiple threads access variables in parallel, you must clearly define:

• Shared variables (same memory for all threads)


• Private variables (each thread gets its own copy)

Improper management leads to:

• Race conditions
• Unexpected results
• Performance degradation

Default Behavior in OpenMP


Variable Type Default Scope
Loop control variables private
All others (outside loop) shared

Shared vs Private – Key Differences


Feature shared private
Memory One instance shared by all Separate copy per thread
Access Concurrent access = race risk Safe, no conflicts
Lifetime Shared across threads Exists only during thread's exec
Use-case Shared config, arrays Loop counters, temp variables

Declaring Shared & Private Variables Explicitly


#pragma omp parallel shared(a) private(i, temp)

• a is visible to all threads


• i and temp are thread-local

Example: Without Private Clause (Wrong!)


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 26 | 59
Unit 5: OpenMP - Summary
#include <stdio.h>
#include <omp.h>

int main() {
int temp; // Shared by default
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}
return 0;
}

Problem:

• All threads share temp


• Outputs are interleaved and unpredictable

Fixed Version: Use private(temp)


#include <stdio.h>
#include <omp.h>

int main() {
#pragma omp parallel for private(temp)
for (int i = 0; i < 4; i++) {
int temp = i * 2;
printf("Thread %d: temp = %d\n", omp_get_thread_num(), temp);
}
return 0;
}

• Now each thread has its own temp, avoiding race conditions.

Scoping Clauses Summary


Clause Description
shared(var) Variable is shared across threads
private(var) Thread gets uninitialized local copy
firstprivate(var) Private, but initialized with original value
lastprivate(var) Final value from last iteration is saved
threadprivate(var) Static thread-local variable across regions

Loop Control Variable


OpenMP automatically makes loop index variable private.

#pragma omp parallel for


for (int i = 0; i < N; i++) {
// 'i' is private by default
}
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 27 | 59
Unit 5: OpenMP - Summary

Firstprivate Example
int x = 5;

#pragma omp parallel firstprivate(x)


{
printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
}

Each thread sees x = 5, but changes are not visible to others.

Lastprivate Example
int x = 0;

#pragma omp parallel for lastprivate(x)


for (int i = 0; i < 4; i++) {
x = i;
}
// x now holds the value from last iteration: 3

Common Mistakes
Mistake Explanation
Not declaring temp private Leads to race conditions
Using private but expecting initial value Use firstprivate instead
Using shared where values are updated Use reduction or atomic instead

Best Practices
1. Declare loop counters as private explicitly, even if they’re default.
2. Use reduction for shared counters.
3. Use firstprivate if you need an initialized value.
4. Avoid sharing temporary variables unless they’re read-only.

5.OpenMP: A Detailed Explanation of Loop Scheduling


and Portioning
In OpenMP, achieving high performance is not just about dividing work among threads; it's about dividing
it smartly. Loop scheduling and portioning is the mechanism that controls how the iterations of a parallel
loop are distributed among the threads in a team. The choice of schedule can have a dramatic impact on
performance, primarily by addressing the critical challenge of load balancing—ensuring that all threads
finish their work at roughly the same time, preventing cores from sitting idle while others are still busy.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 28 | 59
Unit 5: OpenMP - Summary
OpenMP provides fine-grained control over this process through the schedule clause, which is applied to
for or parallel for directives.

Why is Loop Scheduling Necessary? The Load Balancing Problem

Imagine a parallelized loop where some iterations are computationally expensive while others are very
quick.

A Problematic Workload:

C
#pragma omp parallel for
for (int i = 0; i < 16; i++) {
// Imagine work_function(i) takes 'i' seconds to complete.
// Early iterations are fast, later iterations are very slow.
work_function(i);
}
If you have 4 threads, the default scheduling might divide the work like this:

• Thread 0 gets iterations: 0, 1, 2, 3 (Total work: ~6 seconds)


• Thread 1 gets iterations: 4, 5, 6, 7 (Total work: ~22 seconds)
• Thread 2 gets iterations: 8, 9, 10, 11 (Total work: ~38 seconds)
• Thread 3 gets iterations: 12, 13, 14, 15 (Total work: ~54 seconds)

The result? Thread 0 finishes in 6 seconds and then sits idle for the remaining 48 seconds while it waits
for Thread 3 to complete its massive workload. This is a classic example of load imbalance, and it
severely limits the speedup gained from parallelism. The schedule clause is the tool to solve this problem.

The schedule Clause Syntax

The clause is added to the for directive as follows:

#pragma omp for schedule(kind[, chunk_size])

• kind: The type of scheduling to use (e.g., static, dynamic).


• chunk_size: An optional positive integer that specifies the size of the "chunks" of iterations to be
distributed at a time.

The Kinds of Loop Scheduling

1. schedule(static[, chunk_size])

This is the simplest and often the default scheduling kind. The iterations are divided among threads before
the loop starts execution.

• How it works (without chunk_size): The iteration space is divided into roughly equal-sized blocks,
one for each thread. This is known as block scheduling. schedule(static)
• How it works (with chunk_size): The iteration space is divided into contiguous chunks of size
chunk_size. These chunks are then distributed to the threads in a round-robin fashion. This is known
as interleaved or cyclic scheduling. schedule(static, 1)

Characteristics:

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 29 | 59
Unit 5: OpenMP - Summary
• Overhead: Very low. The work distribution is calculated once at the beginning of the loop. There
is no further communication or synchronization needed for scheduling.
• Load Balancing: Can be poor if the work per iteration is unpredictable or uneven.
• Data Locality: Generally good, as threads work on contiguous blocks of iterations, which can lead
to better CPU cache performance.

When to use static:

• When all loop iterations have a uniform, predictable workload. For example, simple array addition:
a[i] = b[i] + c[i].
• When minimizing scheduling overhead is the absolute top priority.
• Use static, 1 when there's a dependency risk with adjacent iterations, for example, to mitigate false
sharing.

Example: #pragma omp for schedule(static, 16)

2. schedule(dynamic[, chunk_size])

This schedule is designed to address load imbalance by distributing work as threads become available.

• How it works: The iteration space is divided into chunks. By default, chunk_size is 1. When a thread
finishes its current chunk, it requests the next available chunk from the OpenMP runtime. This
continues until all chunks are processed.

Characteristics:

• Overhead: High. Each time a thread requests a new chunk, it requires synchronization, which adds
significant overhead compared to static.
• Load Balancing: Excellent. Since threads that finish quickly can immediately grab new work, it
naturally balances uneven loads. No thread sits idle if there is still work to be done.
• Data Locality: Can be poor. A thread may process iteration 5, then 95, then 150, jumping all over
the data and leading to poor cache performance.

When to use dynamic:

• When the workload per iteration is highly variable, unpredictable, or unknown. For example,
processing items in a work queue where each item has a different complexity.
• When load balancing is more important than the overhead of scheduling.
• A larger chunk_size (e.g., dynamic, 16) is a good compromise, reducing the scheduling overhead
(fewer requests) while still providing good load balancing.

Example: #pragma omp for schedule(dynamic, 8)

3. schedule(guided[, chunk_size])

This is an adaptive schedule that attempts to combine the best of static and dynamic. It starts with large
chunks to reduce overhead and moves to smaller chunks to handle load balancing at the end of the loop.

• How it works: Threads dynamically grab chunks of work, but the chunk size decreases over time.
It starts with a large chunk (proportional to the number of remaining iterations divided by the number
of threads) and exponentially shrinks down to the optional chunk_size parameter (or 1 if not
specified).

Characteristics:

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 30 | 59
Unit 5: OpenMP - Summary
• Overhead: Medium. It's a compromise between the low overhead of static and the high overhead
of dynamic.
• Load Balancing: Very good. It provides the load balancing benefits of dynamic scheduling,
especially towards the end of the loop where imbalance is most likely to cause threads to go idle.
• Data Locality: Better than dynamic since threads initially work on larger, more contiguous blocks of
data.

When to use guided:

• A very strong general-purpose choice for loops with some, but not extreme, load imbalance.
• When you want good load balancing without the full overhead penalty of schedule(dynamic, 1).

Example: #pragma omp for schedule(guided)

4. schedule(auto)

This gives control to the OpenMP compiler and runtime system, allowing them to choose the "best"
schedule based on their own internal heuristics.

• How it works: The decision is made by the implementation. It might analyze the loop structure or
use runtime information to choose between static, dynamic, or guided.
• When to use it: When you trust the compiler to make a better decision than you can, or when you
want to write portable code that allows the system to optimize for the specific hardware it's running
on. The actual behavior can vary between compilers.

Example: #pragma omp for schedule(auto)

5. schedule(runtime)

This defers the scheduling decision from compile-time to runtime. The actual schedule to be used is
determined by the value of the OMP_SCHEDULE environment variable.

• How it works: You compile the code with schedule(runtime). Before running the executable, you set
the environment variable. For example: export OMP_SCHEDULE="dynamic,4".
• When to use it: This is extremely useful for tuning application performance without
recompiling. You can experiment with different scheduling strategies and chunk sizes on your
target machine to find the optimal settings for a specific problem.

Example: #pragma omp for schedule(runtime)

Summary Table

Schedule
Overhead Load Balancing Best For...
Kind
Poor (for uneven
static Very Low Perfectly uniform workloads.
loads)
dynamic High Excellent Highly irregular, unpredictable workloads.
A great general-purpose choice for many non-uniform
guided Medium Very Good
loops.
auto Varies Varies Letting the compiler/runtime decide.
Performance tuning and experimentation without
runtime Varies Varies
recompiling.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 31 | 59
Unit 5: OpenMP - Summary

I. OpenMP: Loop Scheduling and Partitioning – Detailed


Explanation

Goal
To efficiently divide loop iterations among threads in a way that:

• Maximizes parallelism
• Minimizes load imbalance
• Avoids idle threads

This is called scheduling, or iteration partitioning in OpenMP.

Why Loop Scheduling Matters


In OpenMP:

#pragma omp parallel for schedule(type[, chunk])


for (int i = 0; i < N; i++) {
// work(i);
}

The schedule clause determines how loop iterations are split across threads.

Three Main Scheduling Types


Type Description When to Use
static Divide iterations evenly and predictably Work per iteration is uniform
dynamic Assign iterations on demand Iteration time varies
guided Like dynamic, but shrinks chunk sizes Initially large chunks, then small

1. static
#pragma omp parallel for schedule(static)

• Divides N iterations into equal-sized chunks.


• Assigns them ahead of time to threads.
• Lowest overhead.

Example:

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 32 | 59
Unit 5: OpenMP - Summary
N = 12 Threads = 3 Chunks: 4 each
T0 0, 1, 2, 3
T1 4, 5, 6, 7
T2 8, 9, 10, 11

2. static, chunk_size

#pragma omp parallel for schedule(static, 2)

• Round-robin assignment of chunks.


• Great for uniform work, allows interleaving.

Threads = 3 Chunk = 2
T0: i = 0–1, 6–7
T1: i = 2–3, 8–9
T2: i = 4–5, 10–11

3. dynamic
#pragma omp parallel for schedule(dynamic, 2)

• Threads request new chunks dynamically.


• Good when execution time per iteration is uneven.
• Higher overhead due to runtime scheduling.

| Thread 1 | gets i = 0–1, then asks for more |


| Thread 2 | gets i = 2–3, then next available chunk |

4. guided
#pragma omp parallel for schedule(guided)

• Like dynamic, but chunk size starts big and shrinks exponentially.
• Reduces scheduling overhead compared to dynamic.
• Often used when N is large.

5. auto and runtime

• auto: Let the compiler/runtime choose the best strategy.


• runtime: Uses scheduling set by OMP_SCHEDULE env variable.

export OMP_SCHEDULE="dynamic,4"

Real Example to Test Behavior


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 33 | 59
Unit 5: OpenMP - Summary
#include <omp.h>
#include <stdio.h>

int main() {
#pragma omp parallel for schedule(dynamic, 2)
for (int i = 0; i < 12; i++) {
printf("Thread %d is handling iteration %d\n", omp_get_thread_num(), i);
}
return 0;
}

Explanation:

• schedule(dynamic, 2) means each thread picks 2 iterations at a time as soon as it’s free.
• Output will vary per run due to runtime decisions.

Summary Table: Loop Scheduling Options


Schedule Chunking Use Case Overhead
static Fixed chunks Uniform work per iteration Low
dynamic On demand Irregular/unknown workloads Medium
guided Shrinking Large loops, variable costs Medium
auto Compiler Let system choose Unknown
runtime Env var Tune without code changes Varies

Best Practices
1. Use static for predictable iteration times.
2. Use dynamic or guided for unpredictable work per iteration.
3. Tune chunk_size — small = better balance, large = less overhead.
4. Use runtime + environment variable for experimentation without code recompilation.

6.OpenMP: A Detailed Explanation on the Effective Use


of Reductions
In OpenMP, the reduction clause is a powerful and essential tool for correctly and efficiently performing one
of the most common parallel programming patterns: combining results from multiple threads into a single
shared variable. Naively attempting this pattern without reduction is a classic recipe for creating data-race
conditions. Understanding how and when to use reductions is fundamental to writing robust and high-
performance OpenMP code.

What is a Reduction? And Why is it Needed?

A reduction operation (or "folding") reduces a set of values down to a single result using a specific
mathematical or logical operator. Common examples include:

• Summing all the elements of an array.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 34 | 59
Unit 5: OpenMP - Summary
• Finding the maximum or minimum value in a dataset.
• Calculating a logical AND/OR across a series of boolean flags.
• Multiplying elements to find a factorial or total product.

The problem in a parallel context is that these operations create a loop-carried flow dependence, leading
to a data race.

The "Wrong" Way (Creating a Data Race):

C
long long sum = 0;
#pragma omp parallel for
for (int i = 0; i < 1000000; i++) {
// DATA RACE! Multiple threads read and write 'sum' concurrently.
sum += array[i];
}
// The final 'sum' will be incorrect and unpredictable.
Here, multiple threads simultaneously execute sum += array[i]. This operation is not atomic; it involves
reading sum, adding to it, and writing it back. Threads will constantly overwrite each other's work, leading
to lost updates and a wrong answer.

One could solve this with a critical or atomic directive:

C
// Correct, but inefficient
#pragma omp parallel for
for (int i = 0; i < 1000000; i++) {
#pragma omp atomic
sum += array[i];
}
While this is correct, it's often inefficient. It forces threads to serialize their access to the sum variable,
creating a bottleneck and negating much of the benefit of parallelism, especially if the loop body is small.

The "Right" Way: Using the reduction Clause

The reduction clause provides an elegant and highly optimized solution.

Syntax: reduction(operator:list_of_variables)

How it Works Internally:

When OpenMP encounters the reduction clause, it performs a three-step process:

1. Create Private Copies: For each thread in the team, OpenMP creates a new, private copy of the
reduction variable (e.g., sum).
2. Initialize Private Copies: Each private copy is initialized to the identity value for the specified
operator (e.g., 0 for +, 1 for *, the largest negative number for max).
3. Perform Local Computation: Each thread executes its portion of the loop iterations, but all
updates are made to its private copy only. Since no data is shared at this stage, there is no data
race and no need for synchronization.
4. Combine Results: After all threads have finished their loop iterations, OpenMP performs a final,
safe, and synchronized operation to combine the results from all private copies into the original,
global variable.

The Correct and Efficient Solution:

C
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 35 | 59
Unit 5: OpenMP - Summary
long long sum = 0;
// OpenMP handles everything: privatization, initialization, and final combination.
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 1000000; i++) {
// Each thread adds to its own private 'sum'
sum += array[i];
}
// After the loop, the global 'sum' is correct.
Key Operators and Use Cases

OpenMP supports a wide range of operators for the reduction clause:

Operator Use Case C/C++ Initial Value


+ Summation + 0
* Product * 1
- Subtraction - 0
& Bitwise AND & All bits 1
**` `** Bitwise OR `
^ Bitwise XOR ^ 0
&& Logical AND && true
**` `** Logical OR `
max Maximum Value (comparison) Smallest representable number
min Minimum Value (comparison) Largest representable number
Example: Finding the Maximum Value and its Location

A common task is to find not just the max value, but also its index. A reduction is perfect for the value, but
the index requires a bit more care.

C
#include <omp.h>
#include <stdio.h>
#include <float.h> // For DBL_MIN

int main() {
double data[] = {1.5, 9.2, 0.5, -1.0, 9.8, 4.3, 7.1};
int n = sizeof(data) / sizeof(data[0]);

double max_val = DBL_MIN;


int max_loc = -1;

// We can't reduce 'max_loc' directly with the 'max' operator.


// So we use a critical section that is entered only when a new max is found.
#pragma omp parallel for reduction(max:max_val)
for (int i = 0; i < n; i++) {
if (data[i] > max_val) {
max_val = data[i];
}
}

// Now that we know the true global maximum, we can find its location.
// This second loop can also be parallelized if 'n' is very large.
for (int i = 0; i < n; i++) {
if (data[i] == max_val) {
max_loc = i;
break; // Found the first occurrence
}
}

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 36 | 59
Unit 5: OpenMP - Summary
printf("Max value is %.2f at index %d\n", max_val, max_loc);
return 0;
}
Output: Max value is 9.80 at index 4

This two-step approach is often clearer and more efficient than trying to force a complex object into a
reduction.

When to Use Reductions: Best Practices

1. Prefer reduction over atomic or critical: If your operation fits one of the standard reduction operators,
always use the reduction clause. It expresses your intent more clearly and gives the OpenMP runtime
the freedom to use highly optimized, often hardware-specific, combining trees, which is much faster
than serialized locking.
2. Use for Loop-Wide Aggregations: Reductions are designed for when you are aggregating a
single value across the entire iteration space of a loop.
3. Ensure the Operation is Associative: The operator must be associative (e.g., (a + b) + c is the
same as a + (b + c)). The order in which OpenMP combines the private results is not guaranteed, so
the operation must be independent of order. Floating-point addition is technically not perfectly
associative, but in most real-world scenarios, the tiny precision differences are acceptable.
4. Keep the Loop Body Lean: The performance benefit of reduction is most pronounced when the
work inside the loop (the part being parallelized) is significant enough to overcome the overhead of
thread creation and the final combination step. For trivial loops, a sequential version might still be
faster.
5. Handling More Complex Reductions (Arrays/Structs): The standard reduction clause does not
work directly on C-style arrays or structs. Newer versions of OpenMP (4.0 and later) allow for user-
defined reductions, but this is an advanced feature. A more common and portable approach for
reducing an array is to use a temporary private array and combine the results manually after the
parallel loop.

By effectively using the reduction clause, you can write parallel code that is not only safe from data races
but also highly scalable and efficient, cleanly expressing a fundamental parallel pattern to the Open-MP
runtime.

I. OpenMP: Effective Use of Reductions — Detailed


Explanation

What Is a Reduction in OpenMP?


A reduction in OpenMP is a way to safely combine partial results from multiple threads into a single final result.

Problem Without Reduction:

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 37 | 59
Unit 5: OpenMP - Summary
If multiple threads update a shared variable, it causes a race condition.

Example: Unsafe Shared Accumulation


int sum = 0;
#pragma omp parallel for
for (int i = 0; i < N; i++) {
sum += i; // Race condition
}

• Multiple threads try to update sum at the same time.


• Leads to unpredictable results.

OpenMP Reduction Syntax


#pragma omp parallel for reduction(op : variable)

• op: operator like +, *, max, etc.


• variable: the shared variable to reduce

Corrected Version with reduction


int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += i; // Safe, each thread keeps its local sum
}

How It Works:

• Each thread gets a private copy of sum


• After the loop, OpenMP combines all private values using +

Supported Reduction Operators


Category Operators
Arithmetic +, -, *, /
Bitwise &, `
Logical &&, `
MIN/MAX min, max (C++20 or vendor-specific)

You can also define custom reductions (since OpenMP 4.0+)

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 38 | 59
Unit 5: OpenMP - Summary
Example: Finding Maximum Value
int max_val = INT_MIN;
#pragma omp parallel for reduction(max:max_val)
for (int i = 0; i < N; i++) {
if (A[i] > max_val)
max_val = A[i]; // Parallel-safe
}

Each thread:

• Tracks its own max_val


• OpenMP reduces them into the global maximum

Example: Multiplication
int product = 1;
#pragma omp parallel for reduction(*:product)
for (int i = 1; i <= N; i++) {
product *= i;
}

• Computes factorial of N in parallel


• Each thread multiplies its part → final result combined

How OpenMP Internally Handles Reduction


Step Description
Initialization Each thread gets a private initialized copy
Computation Threads update their local copy independently
Combination After loop, local values are reduced together
Final Update Master thread gets final result

Things to Watch For


Problem Solution
Using multiple variables List them all in reduction(...)
Want initialized value OpenMP takes care of default init (0 for +, 1 for *)
Need custom combine logic Use declare reduction (OpenMP 4.0+)
Complex loops (early break) Use parallel sections or restructure

Multi-variable Reduction Example


int sum = 0, prod = 1;
#pragma omp parallel for reduction(+:sum) reduction(*:prod)

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 39 | 59
Unit 5: OpenMP - Summary
for (int i = 1; i <= N; i++) {
sum += i;
prod *= i;
}

• Reduces two variables (sum, prod) in one loop

Best Practices for Reduction in OpenMP


1. Always use reduction for shared counters or aggregators
2. Pick the correct operator
3. Avoid using critical or atomic for accumulation — reduction is faster
4. Use declare reduction for custom struct reductions (advanced)

Summary Table: OpenMP Reduction Essentials

Concept Description
Purpose Combine thread-local results safely
Syntax reduction(op:var)
Benefits No race conditions, fast, compiler-optimized
Use Cases sum, product, max/min, logical ops, counters
Advanced declare reduction for custom types

7.OpenMP: A Detailed Explanation of Work-sharing


Constructs
In OpenMP, creating a team of threads with #pragma omp parallel is only the first step. To achieve meaningful
parallelism, you must divide the program's workload among those threads. Work-sharing constructs are
the primary mechanism for this. They are special directives that are placed inside a parallel region and
distribute the execution of the enclosed code block among the team members.

Crucially, a work-sharing construct does not create new threads. It assumes a team of threads already
exists and its sole purpose is to split a specific piece of work among them. These constructs are the heart
of most OpenMP applications, allowing for easy and powerful parallelization of common programming
patterns.

There are three main work-sharing constructs:

1. for: For parallelizing iterative loops (Data Parallelism).


2. sections: For assigning different, independent code blocks to different threads (Task Parallelism).
3. single: For designating a code block to be executed by only one arbitrary thread.

A key feature of these constructs is the implicit barrier at the end. By default, all threads will wait at the
end of a work-sharing construct until every thread in the team has finished its part of the work. This
synchronization is vital for correctness, ensuring that subsequent code is not executed until the shared
work is complete. This barrier can be disabled with the nowait clause if synchronization is not needed.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 40 | 59
Unit 5: OpenMP - Summary

1. The for (or do) Work-sharing Construct

This is, by far, the most frequently used work-sharing construct in OpenMP. It is designed to split the
iterations of a for loop (or do loop in Fortran) among the threads in the team. This is a classic example of
data parallelism, where the same operation is performed on different pieces of data.

How it Works:

The for directive is placed immediately before a for loop. The OpenMP runtime automatically divides the
loop's iteration space (e.g., from 0 to N-1) and assigns a portion of the iterations to each thread.

Syntax:

C
#pragma omp parallel
{
// Other parallel code can go here
#pragma omp for [clauses]
for (int i = 0; i < N; i++) {
// Code inside the loop
}
// Implicit barrier: All threads wait here until the loop is fully complete.
}
Combined Directive:

For convenience, OpenMP allows a combined directive that both creates the parallel region and applies
the work-sharing for construct in one line. This is the most common usage pattern.

C
#pragma omp parallel for [clauses]
for (int i = 0; i < N; i++) {
// Code inside the loop
}
Key Clauses:

• schedule(kind, chunk_size): Controls how iterations are divided (e.g., static, dynamic, guided). This is
crucial for load balancing.
• private, firstprivate, lastprivate: Manage the data environment for variables used within the loop.
• reduction(operator:list): Safely performs reduction operations (e.g., summing into a shared variable).
• nowait: Removes the implicit barrier at the end of the loop.

Use Case: Any loop where the iterations are independent of each other. A perfect example is processing
elements of an array.

Example:

C
double a[1000], b[1000], c[1000];
// Initialize a and b...

#pragma omp parallel for


for (int i = 0; i < 1000; i++) {
c[i] = a[i] + b[i]; // Each iteration is independent
}
// Barrier ensures that all of 'c' is computed before proceeding.

2. The sections Work-sharing Construct


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 41 | 59
Unit 5: OpenMP - Summary
The sections construct handles a different kind of parallelism: task parallelism. It is used when you have
a few, distinct, and independent blocks of code that can be executed concurrently.

How it Works:

The sections construct encloses a set of #pragma omp section blocks. The OpenMP runtime assigns each
section block to a different thread in the team. If there are more threads than sections, the extra threads
will do nothing and skip to the barrier at the end. If there are more sections than threads, threads will
execute multiple sections until all are complete.

Syntax:

C
#pragma omp parallel
{
#pragma omp sections [clauses]
{
#pragma omp section
{
// Code block A (e.g., perform_task_A())
}

#pragma omp section


{
// Code block B (e.g., perform_task_B())
}

#pragma omp section


{
// Code block C (e.g., perform_task_C())
}
}
// Implicit barrier: All threads wait here until A, B, and C are done.
}
Key Clauses:

• private, firstprivate, lastprivate, reduction: Work just as they do for the for construct, managing data for the
enclosed sections.
• nowait: Removes the implicit barrier.

Use Case: Functional decomposition, such as in a pipeline or when performing unrelated calculations that
can happen at the same time. For example, calculating an average, a median, and a standard deviation
on the same dataset can be done in three parallel sections.

Example:

C
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
printf("Task A running on thread %d\n", omp_get_thread_num());

#pragma omp section


printf("Task B running on thread %d\n", omp_get_thread_num());
}
}
Possible Output (order is not guaranteed):

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 42 | 59
Unit 5: OpenMP - Summary
Task B running on thread 1
Task A running on thread 0

3. The single Work-sharing Construct

The single construct is a specialization that ensures a block of code is executed by only one of the threads
in the team—whichever one reaches it first. The other threads skip the block and wait at the implicit barrier
at its end.

How it Works:

The first thread to encounter the single directive will execute the code block. All other threads will bypass
it.

Syntax:

C
#pragma omp parallel
{
// Code executed by all threads
#pragma omp single [clauses]
{
// This block is executed by only one thread.
printf("I am the one and only thread %d executing this.\n", omp_get_thread_num());
}
// Implicit barrier: All threads wait here until the single block is complete.
}
Key Clauses:

• private, firstprivate: Can be used to manage data within the single block.
• nowait: Allows other threads to proceed without waiting for the single thread to finish. This is very
commonly used.

Use Case:

• Initialization: Performing a setup task that only needs to happen once within a parallel region.
• Input/Output: Printing status messages or reading input in the middle of parallel work.
• Finalization: Committing results or finalizing a data structure after a parallel computation.

Example: I/O Operation

C
#pragma omp parallel
{
// Each thread does some work...
do_work();

// Now, have one thread print a progress update without stopping the others.
#pragma omp single nowait
{
printf("Work is 50%% complete.\n");
}

// Threads continue doing other work immediately because of 'nowait'.


do_more_work();
}
A Note on master

The #pragma omp master directive is very similar to single, but with two key differences:
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 43 | 59
Unit 5: OpenMP - Summary
1. Who Executes: The code is only ever executed by the master thread (thread ID 0).
2. No Implicit Barrier: The master directive has no implicit barrier at the end. Other threads do not
wait for the master to finish.

master is often used for the same tasks as single (like I/O), but it's less flexible as it hard-codes the execution
to thread 0. single is generally preferred unless there is a specific reason to tie an action to the master
thread.

By combining these work-sharing constructs, a programmer can effectively orchestrate complex parallel
operations, matching the structure of the code (loops, independent tasks) to the appropriate OpenMP
directive.

What Are Work-Sharing Sections?


OpenMP sections are used to divide different tasks (not just loop iterations) among threads. This is called task
parallelism (as opposed to loop/data parallelism).

Each section represents a different block of code. OpenMP assigns each section to a different thread for parallel
execution.

Syntax Overview
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
// Task 1
}

#pragma omp section


{
// Task 2
}

// Add more sections as needed...


}
}

Simple Example
#include <stdio.h>
#include <omp.h>

int main() {
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
printf("Task A by thread %d\n", omp_get_thread_num());
}

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 44 | 59
Unit 5: OpenMP - Summary
#pragma omp section
{
printf("Task B by thread %d\n", omp_get_thread_num());
}
}
}

return 0;
}

Explanation:

• Two independent tasks: A and B.


• OpenMP assigns one task to each available thread.
• If more threads than sections: remaining threads are idle.
• If more sections than threads: threads pick remaining sections after finishing their first.

Use Case Scenarios


Task Type Example
I/O & compute Read data and compute in parallel
Multiple filters Apply Gaussian, Sobel, and Sharpen
Multi-device Control CPU and GPU code separately

Sections vs Parallel For


Feature sections parallel for
Task style Task parallelism (different work) Data parallelism (same loop)
Thread control Explicit task blocks Implicit loop split
Usage Non-iterative tasks Iterative computations

Optional: nowait Clause


By default, OpenMP waits at the end of a sections block.

#pragma omp sections nowait

Use nowait to let threads move ahead without waiting for others to finish their sections.

Common Mistakes
Mistake Explanation
Missing #pragma omp section Each task must be labeled as a section
Code inside sections but outside section That part is executed by every thread

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 45 | 59
Unit 5: OpenMP - Summary
Mistake Explanation
Too few threads Unused sections are not executed
Using sections inside parallel for Invalid: mixing parallel constructs wrongly

Advanced Example: 3 Tasks


#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{ printf("Thread %d is handling loading\n", omp_get_thread_num()); }

#pragma omp section


{ printf("Thread %d is handling processing\n", omp_get_thread_num()); }

#pragma omp section


{ printf("Thread %d is handling storing\n", omp_get_thread_num()); }
}
}

Each thread picks up one of the 3 different tasks: load, process, store.

Best Practices
1. Use sections for different tasks, not loops.
2. Make sure each task is properly marked with #pragma omp section.
3. Consider nowait if subsequent code doesn’t need to wait.
4. Use nested parallelism if each section itself can be parallelized.

Summary Table
Feature Description
Purpose Divide non-loop tasks across threads
Directive #pragma omp sections + #pragma omp section
Inside parallel Must be inside #pragma omp parallel
Synchronization Implicit barrier unless nowait is used
Use for Independent logic blocks, not loop iterations

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 46 | 59
Unit 5: OpenMP - Summary

8.OpenMP: A Detailed Guide to Compilation and


Debugging
Writing OpenMP code is only half the battle; to bring a parallel program to life, you must know how to
compile it correctly and, more importantly, how to debug it when things go wrong. Debugging parallel
applications introduces challenges not found in sequential programming, such as race conditions and
deadlocks, which require specific tools and techniques.

Compilation: Enabling OpenMP Support

By default, a standard C, C++, or Fortran compiler will ignore #pragma omp directives, treating them as
simple comments. To enable OpenMP, you must explicitly pass a specific flag during compilation.

GCC (GNU Compiler Collection) / Clang

For the most common open-source compilers, GCC (g++) and Clang (clang++), the flag is -fopenmp.

Command Line Syntax (GCC/g++):

Bash
# To compile and create an executable named 'my_program'
g++ -fopenmp my_program.cpp -o my_program

# For C
gcc -fopenmp my_program.c -o my_program
Command Line Syntax (Clang/clang++):

Clang's OpenMP runtime is provided by the LLVM project. The compilation flag is the same.

Bash
# To compile with Clang
clang++ -fopenmp my_program.cpp -o my_program
When you use -fopenmp, the compiler does two main things:

1. Parses Pragmas: It recognizes and interprets the #pragma omp directives to generate threaded
code.
2. Links the Runtime Library: It automatically links the necessary OpenMP runtime library (like
libgomp for GCC), which manages thread creation, scheduling, and synchronization.

Intel C++/Fortran Compilers (icx/ifx)

For the modern LLVM-based Intel compilers, the flag is -fiopenmp. For the classic Intel compilers (icc/ifort),
the flag was -qopenmp.

Command Line Syntax (icx):

Bash
icx -fiopenmp my_program.cpp -o my_program

Execution: Controlling Your Parallel Program

Once compiled, the executable is run like any other program. However, the OpenMP runtime's behavior
can be controlled through environment variables. The most important of these is OMP_NUM_THREADS.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 47 | 59
Unit 5: OpenMP - Summary
OMP_NUM_THREADS

This variable tells the OpenMP runtime how many threads to create for the parallel regions.

How to Use it (Linux/macOS):

Bash
# Set the number of threads to 4 for the next command
export OMP_NUM_THREADS=4

# Run your program


./my_program
The program will now execute its parallel regions with a team of 4 threads.

Best Practices:

• If OMP_NUM_THREADS is not set, the OpenMP runtime will typically default to using the number of
available hardware cores on the system.
• You can also set the number of threads within the code using the omp_set_num_threads() function, but
using the environment variable is often more flexible for testing and deployment.

Debugging: Tackling Parallel Bugs

Debugging OpenMP is challenging because bugs are often non-deterministic. A program might work
correctly ten times with 2 threads, but fail on the eleventh run with 4 threads due to a subtle timing-
dependent race condition.

1. Common Types of OpenMP Bugs

• Data Race: The most common bug. Multiple threads access a shared variable without proper
synchronization, and at least one access is a write. (See the "Managing Shared and Private Data"
explanation for solutions).
• Deadlock: A situation where two or more threads are blocked forever, each waiting for the other to
release a resource. This can happen with improper use of locks or ordered critical sections.
• False Sharing: A performance bug, not a correctness bug. Private data of different threads
happens to reside on the same cache line, causing performance degradation due to cache
invalidations.

2. Basic Debugging Techniques

a) "Printf" Debugging (With a Caveat)

Using printf can be a quick first step, but it has pitfalls in a parallel context. Standard output is a shared
resource, and having multiple threads print simultaneously can result in jumbled, unreadable output.

Thread-Safe Printing: To make printf useful, wrap it in a critical section to ensure only one thread can print
at a time.

C++
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
// ... do some work ...

#pragma omp critical


{
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 48 | 59
Unit 5: OpenMP - Summary
// This block ensures the output is not interleaved.
std::cout << "Thread " << thread_id << " processed item " << i << std::endl;
}
}
b) Compiler Flags for Debugging

When debugging, always compile with the -g flag to include debugging symbols, which are essential for
debuggers like GDB.

Bash
g++ -fopenmp -g my_program.cpp -o my_program
3. Using a Standard Debugger (GDB)

GDB (the GNU Debugger) has support for debugging threaded programs, including OpenMP.

Key GDB Commands for Threads:

• info threads: Lists all threads currently running in your program, their ID, and what line of code they
are executing.
• thread <ID>: Switches the GDB context to a specific thread. You can then inspect its call stack ( bt)
and local variables (p var).
• break <file>:<line> thread <ID>: Sets a breakpoint that only triggers for a specific thread.
• set scheduler-locking on: Freezes all other threads when the current thread hits a breakpoint. This is
extremely useful for examining the state of the entire application at a specific moment without
other threads changing things under the hood.

Example GDB Session:

Bash
# Start GDB
gdb ./my_program

(gdb) # Set a breakpoint in a parallel region


(gdb) break my_program.cpp:25

(gdb) # Run the program with 4 threads


(gdb) set environment OMP_NUM_THREADS 4
(gdb) run

# Program hits the breakpoint


[Switching to Thread 0x7ffff7dcb700 (LWP 12345)]
Breakpoint 1, main () at my_program.cpp:25

(gdb) # See what all threads are doing


(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7... main () at my_program.cpp:25
2 Thread 0x6... main () at my_program.cpp:25
3 Thread 0x5... main () at my_program.cpp:25
4 Thread 0x4... main () at my_program.cpp:25

(gdb) # Switch to thread 3 and inspect a private variable


(gdb) thread 3
(gdb) p my_private_var
4. Advanced Debugging and Race Detection Tools

For complex race conditions, GDB might not be enough. Specialized tools are designed to detect these
issues automatically.

a) Valgrind (Helgrind and DRD)


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 49 | 59
Unit 5: OpenMP - Summary
Valgrind is a framework for dynamic analysis. Its Helgrind and DRD tools are specifically designed to
detect threading errors, including data races and mismanagement of locks.

How to Use:

Bash
# Compile with -g, then run your program through Valgrind/Helgrind
valgrind --tool=helgrind ./my_program
Helgrind will monitor memory accesses and report any potential data races it finds, complete with stack
traces showing where the conflicting accesses occurred. This is one of the most powerful tools for finding
race conditions.

b) Intel® Inspector

This is a commercial graphical tool that is excellent at finding both correctness errors (races, deadlocks)
and performance issues (false sharing) in threaded applications. It provides a rich user interface to
visualize where and why a race condition is happening.

c) Compiler Sanitizers

Modern compilers like Clang and GCC have built-in "sanitizers" that can detect threading errors at runtime.

• ThreadSanitizer (-fsanitize=thread): Add this flag during compilation. When you run the program,
the sanitizer will halt execution and print a detailed report if it detects a data race. This is very
effective but does add significant runtime overhead.

Example with ThreadSanitizer:

Bash
# Compile with the sanitizer enabled
g++ -fopenmp -g -fsanitize=thread my_program.cpp -o my_program_sanitized

# Run the program


./my_program_sanitized
By combining these compilation techniques with a structured approach to debugging—from simple print
statements to powerful dynamic analysis tools—you can effectively build and validate correct and robust
OpenMP applications.

I. OpenMP: Compilation & Debugging – Detailed


Explanation

What Is OpenMP?
OpenMP (Open Multi-Processing) is a compiler-based API for shared-memory parallel programming in C, C++,
and Fortran.

It uses pragmas/directives (#pragma omp) that are processed by the compiler to generate multithreaded code.

COMPILATION of OpenMP Code


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 50 | 59
Unit 5: OpenMP - Summary
OpenMP requires a compiler that supports OpenMP, such as:

Compiler OpenMP Flag


GCC / Clang -fopenmp
MSVC (Windows) /openmp
Intel ICC -qopenmp

GCC Compilation Example (C)


gcc -fopenmp my_program.c -o my_program

• -fopenmp: Enables OpenMP processing.


• Links against the libgomp (GNU OpenMP library).

GCC Compilation Example (C++)


g++ -fopenmp my_program.cpp -o my_program

Makefile Snippet Example


CC = gcc
CFLAGS = -fopenmp -O2

my_program: my_program.o
$(CC) $(CFLAGS) -o my_program my_program.o

my_program.o: my_program.c
$(CC) $(CFLAGS) -c my_program.c

DEBUGGING OpenMP Programs


Debugging multithreaded OpenMP code is challenging due to:

• Non-deterministic behavior
• Race conditions
• Deadlocks

1. Print Thread ID
#include <omp.h>
printf("Thread %d out of %d\n", omp_get_thread_num(), omp_get_num_threads());

Use this to trace thread behavior.

2. Use Environment Variables


D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 51 | 59
Unit 5: OpenMP - Summary
You can control runtime behavior with these:

Variable Description
OMP_NUM_THREADS Sets number of threads
OMP_SCHEDULE Controls loop scheduling (if runtime)
OMP_DYNAMIC Allows dynamic thread adjustment
OMP_STACKSIZE Controls thread stack size
export OMP_NUM_THREADS=4
export OMP_SCHEDULE="dynamic,2"

3. Use Debugging Tools

GDB (GNU Debugger)

Compile with -g:

gcc -fopenmp -g my_program.c -o my_program


gdb ./my_program

Then use commands:

break main
run
bt # Backtrace
info threads
thread 2 # Switch to thread 2

4. Debug Race Conditions with Thread Sanitizer

Compile with ThreadSanitizer:

gcc -fopenmp -fsanitize=thread -g my_program.c -o my_program


./my_program

If there's a data race, it will print a warning and source line!

-fsanitize=thread works best with Clang and newer GCC versions.

5. Enable Verbose Output


export OMP_DISPLAY_ENV=TRUE
./my_program

This shows OpenMP's runtime environment, threads, stack size, affinity, etc.

Example: Debugging Data Race


int sum = 0;
#pragma omp parallel for
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 52 | 59
Unit 5: OpenMP - Summary
for (int i = 0; i < 100; i++) {
sum += i;
}

Unsafe due to race condition.

Fix it:

#pragma omp parallel for reduction(+:sum)

Summary Table
Task How to Do It
Enable OpenMP -fopenmp compiler flag
Set thread count export OMP_NUM_THREADS=4
Debug with GDB gdb ./my_program + info threads
Detect data races -fsanitize=thread
Schedule tuning export OMP_SCHEDULE="dynamic,2"
View thread layout export OMP_DISPLAY_ENV=TRUE

9.OpenMP Performance: A Detailed Explanation


Achieving high performance with OpenMP is more nuanced than simply adding #pragma omp parallel for to a
loop. It's a balancing act of parallelizing the right code, managing data effectively, minimizing overhead,
and understanding the interplay between your software and the underlying hardware. A poorly
implemented OpenMP program can even run slower than its sequential counterpart.

This detailed explanation covers the critical factors that govern OpenMP performance and the techniques
to optimize them.

1. The Theoretical Limit: Amdahl's Law

Before diving into code, it's crucial to understand the theoretical ceiling on performance improvement.
Amdahl's Law states that the maximum speedup of a program is limited by its sequential fraction.

The formula for speedup is:

S=(1−P)+NP1

Where:

• S is the theoretical speedup.


• P is the proportion of the program that can be parallelized (e.g., 0.9 for 90%).
• N is the number of processor cores.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 53 | 59
Unit 5: OpenMP - Summary
Implication: If only 80% of your program can be parallelized (P=0.8), even with an infinite number of cores
(N→∞), the maximum speedup you can achieve is 1/(1−0.8)=5 times.

Practical lesson: Focus your efforts on parallelizing the most time-consuming parts of your application
(the "hotspots"). Profiling your code before parallelization is essential to identify where to spend your time.

2. Load Balancing: Keeping All Cores Busy

The single most important practical factor in OpenMP performance is load balancing. If one thread is
assigned significantly more work than others, the remaining threads will finish early and sit idle, wasting
computational resources. The total execution time is dictated by the last thread to finish.

The Challenge: Uneven workloads in loop iterations. For example, processing a triangular matrix or a
loop where the work depends on the input data.

The Solution: The schedule Clause

OpenMP's schedule clause on a for loop dictates how iterations are distributed among threads. Choosing
the right one is critical.

• schedule(static): Low overhead. The work is divided into contiguous chunks and assigned before the
loop starts.
o Best for: Loops where every iteration takes the same amount of time.
o Performance Trap: Can lead to terrible load imbalance if workloads are uneven.
• schedule(dynamic): High overhead. Threads request a chunk of iterations as they become free.
o Best for: Loops with unpredictable or highly variable iteration costs. It provides excellent
load balancing.
o Performance Trap: The overhead of threads constantly requesting work can be significant.
Using a larger chunk size (dynamic, 16) can mitigate this.
• schedule(guided): Adaptive. Starts with large chunks and progressively makes them smaller.
o Best for: A great general-purpose choice that balances low overhead at the start with fine-
grained load balancing at the end. Often a good first choice for non-uniform loops.

3. Data Management and Locality

How threads access data is as important as how they perform computations. Modern CPUs are orders of
magnitude faster than main memory, making efficient use of CPU caches paramount.

The Challenge: False Sharing

This is a pernicious and hidden performance killer. It occurs when:

1. Different threads write to private variables.


2. These variables happen to reside on the same cache line (the smallest unit of memory a CPU can
fetch, typically 64 bytes).

When one thread writes to its variable, the hardware's cache coherency protocol invalidates the entire
cache line for all other threads. Even though the other threads aren't using that specific variable, they are
forced to discard the line and perform a slow re-fetch from main memory when they need to access their
own data on that same line.

The Solution: Padding

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 54 | 59
Unit 5: OpenMP - Summary
Ensure that data that will be written to by different threads is separated by at least one cache line's worth
of memory.

Example: A common mistake

C
// Potential false sharing
int results[NUM_THREADS];
#pragma omp parallel
{
int id = omp_get_thread_num();
results[id] = do_calculation(); // Writing to adjacent memory locations
}
Example: With padding

C
// Padded to avoid false sharing
struct padded_int {
int value;
char padding[60]; // 64-byte cache line - 4 bytes for int
};
struct padded_int results[NUM_THREADS];

#pragma omp parallel


{
int id = omp_get_thread_num();
results[id].value = do_calculation();
}
4. Minimizing Overhead

Parallelism is not free. OpenMP incurs overhead from several sources:

• Thread Creation/Destruction: Starting and stopping a parallel region has a cost.


• Synchronization: Barriers, critical sections, and atomic operations cause threads to wait.
• Scheduling: Dynamic scheduling has higher overhead than static.

Optimization Techniques:

• Parallelize at the Outermost Level: It is much more efficient to parallelize one outer loop than to
repeatedly create and destroy threads inside an inner loop.

// Good: One parallel region


#pragma omp parallel for
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) { /* ... */ }
}

// Bad: N parallel regions


for (int i = 0; i < N; i++) {
#pragma omp parallel for
for (int j = 0; j < M; j++) { /* ... */ }
}
• Use the nowait Clause: If the synchronization provided by an implicit barrier at the end of a for or
sections construct is not needed, eliminate it with nowait to allow threads to proceed independently.

5. Efficient Synchronization

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 55 | 59
Unit 5: OpenMP - Summary
Synchronization is necessary for correctness but is also a major source of performance degradation
because it forces threads to wait.

The Hierarchy of Synchronization (from most to least efficient):

1. No Synchronization (Privatization): The fastest approach is to avoid sharing entirely. If each


thread can work on a private copy of the data, there is no need to synchronize.
2. reduction: This clause is highly optimized for common operations like sum, max, min, etc. It works
by creating private copies and combining them with a single, efficient synchronization step at the
end. Always prefer reduction over manual locking for these patterns.
3. atomic: Use this for protecting single, simple update statements (e.g., x++, y -= z). It is more
lightweight than a critical section as it may be implemented with special hardware instructions.
4. critical: This protects a larger block of code, ensuring only one thread can enter it at a time. It has
higher overhead than atomic and can become a major bottleneck if the critical section is large or
frequently contested.
5. Locks: OpenMP locks (omp_lock_t) are the most general but also the most complex and potentially
slowest mechanism. Use them only when the logic is too complex for the other constructs.

6. Hardware Affinity and Environment

How threads are mapped to physical CPU cores can significantly impact performance, especially on multi-
socket NUMA (Non-Uniform Memory Access) systems.

• Thread Affinity: This is the practice of "binding" or "pinning" a thread to a specific CPU core. This
can improve cache performance by ensuring a thread consistently runs on the same core, keeping
its data hot in that core's L1/L2 cache.
• Environment Variables:
o OMP_NUM_THREADS: Setting this to a value greater than the number of available physical
cores often leads to performance degradation due to thread-switching overhead (thrashing).
o OMP_PROC_BIND / GOMP_CPU_AFFINITY: These can be used to control how threads are bound
to cores (e.g., close to bind them to adjacent cores, spread to spread them out across sockets).

By systematically addressing these areas—profiling to find hotspots, ensuring good load balance with the
right schedule, managing data to maximize cache hits and avoid false sharing, minimizing overhead, and
using the most efficient synchronization strategy—you can transform a simple parallelized program into a
truly high-performance OpenMP application.

I. OpenMP: Performance — Detailed Explanation


OpenMP enables shared-memory parallelism in C, C++, and Fortran. But writing correct code isn't enough — you
must also write high-performance parallel code. This requires understanding how OpenMP interacts with threads,
memory, CPU cores, and synchronization.

Key Factors That Affect OpenMP Performance


Category Factor
Thread behavior Number of threads, thread affinity
Scheduling & workload Loop balance, scheduling type
D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 56 | 59
Unit 5: OpenMP - Summary
Category Factor
Synchronization overhead Locks, barriers, atomics, critical sections
Memory locality False sharing, NUMA access patterns
Granularity Task/loop size too small or too big

1. Thread Count and OMP_NUM_THREADS


Using too few threads doesn’t use full CPU.
Using too many threads may add overhead.

export OMP_NUM_THREADS=4

Set this to match the physical or logical cores of your system.

Use omp_get_num_threads() at runtime to verify thread count.

2. Loop Scheduling Type and Load Balancing


Choose correct scheduling to avoid idle threads or imbalanced workloads.

Schedule Type Best for Overhead


static Equal work per iteration Low
dynamic Varying iteration times Medium
guided Decreasing chunk sizes (big loops) Medium

Example:
#pragma omp parallel for schedule(dynamic, 2)

→ Helps when some loop iterations take longer than others.

3. Synchronization Overhead
Synchronization can become a performance bottleneck.

Method Overhead Use when…


reduction Low For summing, counting
atomic Medium For simple shared updates
critical High For complex exclusive access

Avoid unnecessary use of #pragma omp critical — it serializes the code.

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 57 | 59
Unit 5: OpenMP - Summary
4. Memory Access Patterns
Poor memory layout leads to cache misses and false sharing.

False Sharing Example:


int A[4];
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
A[i]++; // Each A[i] may lie on same cache line → BAD
}

Fix:

Add padding between elements or use thread-private arrays.

5. Granularity of Work
• Too little work per thread = overhead dominates.
• Too much work per thread = poor load balance.

Rule of Thumb:

Each thread should do at least ~1000–10000 instructions to amortize overhead.

6. Affinity and Thread Pinning


Use OMP_PROC_BIND=true or KMP_AFFINITY=granularity=fine,compact to bind threads to cores.

This improves:

• Cache reuse
• Memory locality
• Performance consistency

Benchmarking Performance
Measure execution time with omp_get_wtime():
double t1 = omp_get_wtime();

// parallel region

double t2 = omp_get_wtime();
printf("Time taken: %f seconds\n", t2 - t1);

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 58 | 59
Unit 5: OpenMP - Summary
Tools for Performance Tuning
Tool Purpose
gprof, perf General profiling
Intel VTune Advanced profiling and thread analysis
ThreadSanitizer Detect race conditions
LIKWID Cache and memory bandwidth profiling
OMP_DISPLAY_ENV Show thread layout and settings

Best Practices for Performance in OpenMP


Tip Why It Helps
Minimize critical sections Avoid serialization
Use reduction for sums Fast, no locking needed
Schedule dynamically if needed Better load balance
Keep data local to threads Improves cache usage
Avoid false sharing Prevents cache line invalidation
Match threads to physical cores Maximize parallelism
Benchmark and tune iteratively Profile → Analyze → Optimize

Summary Table
Factor Recommendation
Threads Match to CPU cores
Scheduling Choose based on loop balance
Synchronization Use reduction, avoid critical
Memory behavior Prevent false sharing
Task size (granularity) Enough work per thread
Affinity Use core binding for locality

D r. R . G i r i s h a , P r o f e s s o r, D e p t . O f C S E , P E S C E , M a n d y a P a g e 59 | 59

You might also like