0% found this document useful (0 votes)
6 views

Parallel Programming Unit 2

This document provides an overview of OpenMP programming for shared memory systems, highlighting key concepts such as concurrency, synchronization, and communication. It discusses various programming models, synchronization techniques, and performance considerations, along with practical examples of using threads and OpenMP directives. Additionally, it covers the advantages of using OpenMP over lower-level threading models like Pthreads, emphasizing ease of programming and the ability to optimize performance across different hardware configurations.

Uploaded by

beinggord02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Parallel Programming Unit 2

This document provides an overview of OpenMP programming for shared memory systems, highlighting key concepts such as concurrency, synchronization, and communication. It discusses various programming models, synchronization techniques, and performance considerations, along with practical examples of using threads and OpenMP directives. Additionally, it covers the advantages of using OpenMP over lower-level threading models like Pthreads, emphasizing ease of programming and the ability to optimize performance across different hardware configurations.

Uploaded by

beinggord02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Unit:2

OpenMP programming for


shared memory systems
Introduction to Shared Address Space Platforms
● Definition: Platforms where memory is accessible to multiple processors
● Implicit communication through shared memory
● Focus on concurrency and synchronization constructs

Importance of Parallel Programming


● Specification of parallel tasks and interactions
● Synchronization and communication of intermediate results
● Goal: Efficient and scalable parallel execution
Key Concepts in Shared Address Space Programming

● Concurrency: Multiple tasks running simultaneously, potentially interacting with


each other.
● Synchronization: Coordinating task execution to ensure correct results (e.g.,
preventing two tasks from writing to the same memory at the same time).
● Communication: Implicit, through shared memory, without the need for explicit
data transfers.
● Overhead Minimization: Reducing the performance cost of managing
synchronization and context switches.
Process-Based Models
● Overview: Separate processes with isolated memory spaces, communicating via
explicit mechanisms.
● Data Sharing: Achieved using system calls like:
○ shmget: Allocate shared memory
○ shmat: Attach shared memory to process address space
● Advantages:
○ Strong memory protection
○ Suitable for distributed systems
● Disadvantages:
○ High overhead due to context switching
○ Complex explicit communication
Lightweight Processes and Threads

● Threads: Smaller, faster execution units within a process, sharing the same
memory space.
● Advantages:
○ Low context-switching overhead
○ Easier data sharing (global memory access)
● Disadvantages:
○ Risk of race conditions and data corruption
○ Need for careful synchronization
● Use Cases: Scientific computing, real-time systems, high-performance computing
Directive-Based Programming Models

● Purpose: Simplify thread management and synchronization through high-level


constructs.
● Example Paradigms:
○ OpenMP: Uses compiler directives for parallel loops and sections.
○ Intel TBB: Template-based parallel programming.
○ Cilk: Task-based parallelism with work-stealing scheduler.
● Key Features:
○ Thread creation and management
○ Synchronization and reduction operations
Synchronization Techniques

● Locks & Mutexes: Ensure mutual exclusion, preventing multiple threads from
accessing critical sections simultaneously.
● Semaphores: Counters used to control access to shared resources.
● Barriers: Synchronization points where threads must wait for each other to
proceed.
● Atomic Operations: Low-level, fast synchronization primitives avoiding full lock
mechanisms.
● Condition Variables: Enable threads to wait for specific conditions before
continuing.
Performance Considerations

● Synchronization Overhead: Delays caused by locks, barriers, and other


coordination mechanisms.
● False Sharing: Cache inefficiency when threads inadvertently share cache lines.
● Load Balancing: Even distribution of work across threads to prevent bottlenecks.
● Granularity: Choosing appropriate task size to balance parallelism and overhead.
● Memory Bandwidth: Avoiding memory contention and optimizing cache usage.
Extensions and Future Directions

● Hybrid Models: Combining shared address space with message passing (e.g.,
MPI + OpenMP).
● Hardware-Assisted Synchronization: Leveraging modern CPUs' built-in atomic
instructions and cache coherence protocols.
● Adaptive Runtime Systems: Dynamically adjusting thread count and work
distribution based on runtime conditions.
● AI-Driven Scheduling: Using machine learning to predict and optimize parallel
task execution.
Threads
● Definition: A thread is a single stream of control within a program.
● Key Characteristics:
○ Independent sequence of instructions
○ Can run concurrently on multiple processors
● Example: Matrix multiplication, where each matrix element computation can be a
separate thread.
Threads in Matrix Multiplication
Sequential Code:

for (row = 0; row < n; row++)


for (column = 0; column < n; column++)
c[row][column] = dot_product(get_row(a, row), get_col(b, col));

Threaded Code:

for (row = 0; row < n; row++)


for (column = 0; column < n; column++)
c[row][column] = create_thread(dot_product(get_row(a, row), get_col(b, col)));
Explanation:

● Each iteration is a thread.


● Threads are created using create_thread.
● The system schedules these threads on available processors.
Logical Memory Model of a Thread
● Shared Address Space: All threads can access global memory.
● Thread-Local Memory: Stack variables are private to each thread.
● Why Stack is Local:
○ Threads are invoked as function calls.
○ Stack liveness is unpredictable due to runtime scheduling.
○ Treating stack data as global can lead to data corruption.
Memory Model Visualization
● Global Memory (Shared): Accessible by all threads.
● Local Memory (Private): Each thread has its own stack space
Significance of Thread
Software Portability
Key Advantage:
Threaded applications can run on both serial and parallel machines without modification.

Implications:

● Easier migration between platforms: You don’t need to change your code when switching
from a single-core to a multi-core machine. Threads handle parallelism naturally.
● Cost and resource savings: Less need for specialized, high-performance hardware (like
supercomputers) — multi-threading can optimize performance on standard machines.

Example:
Imagine you’re writing a program that processes images. Using threads, your program can work
on different parts of the image simultaneously. This works whether your machine has 2 cores or
16 cores, without changing the code.
Latency Hiding
Problem:
High overheads from memory access, I/O, and communication can slow down applications.

Solution:
Threads can mask latency. When one thread is waiting for data (like reading from a file or fetching
memory), other threads can continue executing, keeping the CPU busy.

Example:
In a web server, one thread can handle an incoming request while another waits for a database
response. This reduces idle time and increases throughput.
Scheduling and Load Balancing
Challenge:
Distributing work evenly across processors:

● Structured applications (like matrix multiplication): Easy to split tasks evenly.


● Dynamic or unstructured applications (like recursive algorithms): Harder to balance, as work
may grow unpredictably.

Threaded API Solution:

● Threads let you define many small concurrent tasks.


● The system dynamically maps these tasks to available processors, reducing idle time and
optimizing performance.

Example:
If you have 100 tasks and 4 processors, threads can dynamically assign new tasks to a processor as
soon as it finishes a previous one.
Ease of Programming & Widespread Use
Advantage:
Threaded programs are often easier to write and understand than message-passing programs (like MPI).

POSIX Thread API:

● Industry-standard for thread management in C/C++.


● Well-supported with libraries, documentation, and development tools.

Example:
Creating a thread in POSIX is as simple as calling pthread_create(), and you can join threads with
pthread_join(). This is much simpler than manually setting up message-passing protocols.
Trade-offs and Considerations
1. Performance Tuning:
Getting the best performance might still require effort — like deciding the optimal number of threads
or avoiding contention for shared resources.
2. Synchronization:
Threads share memory, so careful handling is required to prevent:
● Race conditions: Multiple threads access and modify the same variable simultaneously.
● Deadlocks: Threads wait indefinitely for resources locked by each other.
OpenMP
OpenMP (Open Multi-Processing) is a directive-based parallel programming API that
works with C, C++, and Fortran to simplify writing programs for shared address space
machines (i.e., systems where multiple processors share the same memory).

The key idea is that instead of manually handling threads (like in Pthreads), OpenMP
allows you to write compiler directives that tell the compiler how to split up tasks into
threads. This makes it much easier for application programmers to add parallelism to
their code!
Why Use OpenMP Instead of Pthreads?

While Pthreads (POSIX Threads) gives you fine-grained control, it’s low-level and
complex. You must:

● Create and manage threads


● Handle synchronization with mutexes and condition variables
● Manually define thread-local vs shared data

In contrast, OpenMP abstracts away this complexity. You use simple #pragma directives,
and OpenMP takes care of:

● Creating and managing threads


● Handling concurrency and synchronization
● Managing data scoping (private vs shared variables)
The OpenMP Programming Model

OpenMP programs generally follow this model:

1. Start with a single thread (master thread).


2. Encounter a parallel directive — OpenMP creates a team of threads.
3. Execute the parallel block with all threads.
4. Merge threads back into the master thread at the end of the parallel block.
Syntax of OpenMP Directives
Directives in OpenMP use the #pragma compiler directive

#pragma omp directive [clause list]

Here:

● #pragma omp: Tells the compiler it’s an OpenMP directive.


● directive: Specifies the parallel operation (e.g., parallel to create threads).
● clause list: Optional clauses to control parallelism (like number of threads, data handling).

Example:

#pragma omp parallel num_threads(4)

{ printf("Hello from thread!\n"); }

This code prints the message 4 times because 4 threads are created!
What is a Pragma?

● Pragma: A compiler directive that provides additional information.


● In C/C++, a pragma starts with #.
● Syntax:
#pragma omp <directive>
● OpenMP pragmas tell the compiler how to parallelize sections of code.
The parallel Directive

The most basic OpenMP directive is parallel, which spawns multiple threads

#pragma omp parallel [clause list]

// Parallel region (executed by multiple threads)

How it works:

● The thread that reaches this directive becomes the master thread (ID = 0).
● New threads are created — number of threads can be specified or decided
automatically.
● Each thread runs the code inside the block independently.
The Parallel for Pragma
● #pragma omp parallel for: Tells the compiler to parallelize the following loop.

Example:

#pragma omp parallel for


for (i = first; i < size; i += prime)
marked[i] = 1;

● Compiler generates code to split iterations across threads.


Loop Conditions for Parallelization
● For successful parallelization, loops must follow a canonical shape:
○ Single loop index variable.
○ Linear iteration pattern (e.g., index += inc).
○ No premature exits (break, return, exit).
Valid Loop Shapes:
for (index = start; index < end; index++)
for (index = start; index <= end; index += inc)

InValid Loop Shapes:


for (index = start; index < end; index++) {
if (some_condition)
break; // Not allowed
How OpenMP Handles Threads
● Master Thread: Creates and coordinates worker threads.
● Worker Threads: Execute chunks of iterations.

Each thread gets its own execution context:

● Private variables: Unique to each thread.


● Shared variables: Accessible by all threads.

Example:
int main() {
int b[10];
int i;
#pragma omp parallel for
for (i = 0; i < 10; i++)
b[i] = i;
}

i: Private

b: Shared
Scheduling Loops in OpenMP
● Scheduling: Determines how iterations are assigned to threads.
● Types of Scheduling:
○ Static: Equal-sized chunks assigned at compile time.
○ Dynamic: Iterations assigned at runtime.
○ Guided: Threads grab decreasing chunk sizes.

Example:

#pragma omp parallel for schedule(type)


for (int i = 0; i < 100; i++) {
// Loop body
}
Static Scheduling
● Equal Partitioning: Iteration space divided into equal chunks.
● Syntax:
#pragma omp for schedule(static)
● Example: Matrix Multiplication (Outer loop parallelized)
#pragma omp parallel default(private) shared(a, b, c, dim) \
num_threads(4)
#pragma omp for schedule(static)
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j++) {
c(i, j) = 0;
for (k = 0; k < dim; k++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
Dynamic Scheduling

● Adaptive Work Distribution: Chunks assigned to idle threads.


● Syntax:
schedule(dynamic[, chunk-size])
● Use Case: When loop iterations take variable time.

Guided Scheduling

● Exponential Chunk Reduction: Chunk size decreases as iterations progress.


● Syntax:
schedule(guided[, chunk-size])
● Advantage: Reduces idling by balancing workloads.
omp_get __ num_procs and omp _ set _num _ threads
omp_get __ num_procs
● omp_get __ num_procs returns the number of physical processors available for use by the parallel
program.
● Here is the function header: int omp_get_num_procs (void)
● The integer returned by this function may be less than the total number of physical processors in the
multiprocessor, depending on how the run-time system gives processes access to processors
omp _ set _num _ threads
● Function omp _ set _num _ threads uses the parameter value 10 set the number of threads to be
active in parallel sections of code.
● It has this function header: void omp _ set _num _ threads (int t)
● Since this function may be called at multiple points in a program, you have the ability to tailor the level
of parallelism to the grain size or other characteristics of the code block.
Declaring Private Variable

● OpenMP allows parallelization of loops for performance gains.


● Variables accessed by threads need careful management.
● Default behavior: loop index is private, other variables are shared.

Problem: Shared variables can cause race conditions in parallel loops.

Solution: Use private clauses to control variable scope.


○ private(variable_list): Each thread gets its own copy of the variables.
○ firstprivate(variable_list): Same as private, but initializes variables with their original
values.
○ shared(variable_list): Variables are shared among threads (need synchronization!).
The Private Clause
● Syntax:
#pragma omp parallel for private(variable_list)
● Creates a private copy of variables for each thread.
● Example:
#pragma omp parallel for private(j)
for (i = 0; i < BLOCK_SIZE(id, p, n); i++)
for (j = 0; j < n; j++)
a[i][j] = min(a[i][j], a[i][k] + tmp[j]);

● The private copies of variable j will be accessible only inside the for loop. The values are undefined
on loop entry and exit.
● Even if j had a previously assigned value before entering the parallel for loop, none of the threads
can access that value. Similarly, whatever values the threads assign to j during execution of the
parallel for loop, the value of the shared j will not be affected.
The firstprivate Clause
● Sometimes we want a private variable to inherit the value of the shared variable.
● The firstprivate clause, with syntax
#pragma omp parallel for private(j) firstprivate(x)
● It directs the compiler to create private variables having initial values identical to the value of the
variable controlled by the master thread as the loop is entered.
● Example:
x[0] = complex_function();
#pragma omp parallel for private(j) firstprivate(x)
for (i = 0; i < n; i++) {
for (j = 1; j < 4; j++)
x[j] = g(i, x[j-1]);
answer[i] = x[1] - x[3]; }
● x[0] is initialized for each thread.
● Threads start with the same initial value but modify their copies.
The Lastprivate Clause
● Saves the value of a private variable from the last sequential iteration.
● Useful for preserving final results of a computation.

Syntax:

#pragma omp parallel for private(j) firstprivate(x)

Example:

#pragma omp parallel for private(j) lastprivate(x)


for (i = 0; i < n; i++) {
x[0] = 1.0;
for (j = 1; j < 4; j++)
x[j] = x[j-1] * (i + 1);
sum_of_powers[i] = x[0] + x[1] + x[2] + x[3];
}
n_cubed = x[3];

In the sequentially Last iteration of the loop, x [3] gets assigned the value n3. In order to have this, value
accessible outside the parallel for loop, we must declare x to be a lastprivate variable.
Combining Clauses
● A parallel for pragma can use both firstprivate and lastprivate.
● Useful for initialization and final result capture.

Example:

#pragma omp parallel for private(j) firstprivate(x) lastprivate(y)

● x: Initialized from master thread, private for each thread.


● y: Private for each thread, but final value saved.
Loop Inversion and Performance Restoration
● Loop inversion is a technique used to improve the performance of parallel programs by restructuring
nested loops. It helps reduce synchronization overhead and improves cache efficiency.

Problem in the Original Loop

for (i = 1; i < m; i++)

for (j = 0; j < n; j++)

a[i][j] = 2 * a[i-1][j];

● The outer loop (i) cannot be parallelized because each iteration depends on the previous iteration (a[i-
1][j]).
● This introduces data dependencies that force serial execution.
● Parallelizing the inner loop (j) is possible, but it results in high fork-join overhead since the outer loop
runs sequentially.
Loop Inversion to Improve Performance

By swapping the order of loops:

#pragma omp parallel for private(i)

for (j = 0; j < n; j++)

for (i = 1; i < m; i++)

a[i][j] = 2 * a[i-1][j];

● The outer loop (j) is now parallelized.


● The data dependency (across i) remains intact, but the number of fork-join steps is
reduced from m-1 to 1, significantly improving performance.
● Performance is restored by minimizing synchronization costs.
Why This Works?
● Row-wise Dependency: Since each row depends on the previous one, parallelizing i would
cause conflicts.
● Column Independence: Columns can be updated in parallel, ensuring better efficiency.
● Cache Efficiency: Since C uses row-major order, working with columns may reduce cache
locality but can still be beneficial depending on system architecture.
Performance Trade-Offs

● Synchronization Overhead: Drastically reduced as only one parallel section is created


instead of multiple.
● Cache Hit Rate: Might reduce because elements are accessed column-wise, which
depends on m, n, and thread count.
● Memory Bandwidth: Improved since threads work independently on different parts of
the array.
single Pragma

● Problem: We need to parallelize j but ensure some parts of i execute only once.
● Solution: #pragma omp single ensures only one thread executes a block.
● Syntax:

#pragma omp single


Code Example with single Pragma
Before using single:
#pragma omp parallel private(i, j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high)
printf("Exiting during iteration %d\n", i);
break;
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i]) / b[i];
}
⚠ Issue: The error message might print multiple times.
Corrected Code with single Pragma
#pragma omp parallel private(i, j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
#pragma omp single
printf("Exiting during iteration %d\n", i);
break;
}
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i]) / b[i];
}
● Fix: single ensures that only one thread prints the message.
nowait Clause

Issue: OpenMP adds a synchronization barrier at the end of for loops.


Why is this a problem?

● All threads must finish before moving forward, even when unnecessary.

Solution: Use #pragma omp for nowait to skip unnecessary barriers.


Optimized Code with nowait
#pragma omp parallel private(i, j, low, high)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
#pragma omp single
printf("Exiting during iteration %d\n", i);
break;
}
#pragma omp for nowait
for (j = low; j < high; j++)
c[j] = (c[j] - a[i]) / b[i];
}
● Optimization:
○ low and high are made private, so no dependencies exist.
○ nowait avoids unnecessary synchronization.
Race Conditions
● Race Condition: Occurs when multiple threads read and write shared data
unpredictably.
● Example: Computing π using numerical integration.
Example
#include <omp.h>
#include <stdio.h>
int main() {
int i, n = 10000;
double pi = 0.0, x;
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i + 0.5) / n;
pi += 4.0 / (1.0 + x * x); // Race condition occurs here
}
pi /= n;
printf("Value of Pi: %f\n", pi);
return 0;
}


Problem: Multiple threads modify pi simultaneously, leading to incorrect results.

Why Do Race Conditions Occur?
● In parallel execution, threads operate independently and may interleave unpredictably.
● Example:
○ Thread A reads pi = 0.5.
○ Thread B reads pi = 0.5, computes, and updates pi = 0.7.
○ Thread A updates pi = 0.6, overwriting the update from Thread B.
● Final result is incorrect due to lost updates.
Using Critical Sections to Avoid Race Conditions
● Critical Section ensures only one thread accesses a shared variable at a time.
● OpenMP provides #pragma omp critical for this purpose.

Corrected Code using Critical Sections


#pragma omp critical

pi += 4.0 / (1.0 + x * x);

● Advantages:
○ Ensures correctness.
● Disadvantages:
○ Slows down performance due to serialization.
Performance Issues with Critical Sections
● Only one thread executes inside the critical section at a time, reducing parallel efficiency.
● Example Performance Comparison:

Critical Sections reduce parallel efficiency as the number of threads increases.


Using OpenMP Reduction for Efficiency

● Reduction Clause provides a more efficient way to handle accumulation.


● Each thread maintains a local copy of the variable and performs partial accumulation.
● OpenMP combines these partial results at the end.

Corrected Code using OpenMP Reduction


#pragma omp parallel for private(x) reduction(+:pi)

for (i = 0; i < n; i++) {

x = (i + 0.5) / n;

pi += 4.0 / (1.0 + x * x);

}
● Advantages:
○ Eliminates race conditions.
○ More efficient than critical sections.
Comparison of Critical Sections vs. Reduction Clause

Reduction is preferred for summation and similar operations.


Performance Analysis of Reduction vs Critical Section

● Comparison of execution time using reduction vs. critical sections:

Observation: Reduction scales better with the number of threads.


Parallel Section

● A section of code that can be executed concurrently by multiple threads.


● Allows functional parallelism where different functions execute simultaneously.
● Syntax:

#pragma omp parallel sections


{
#pragma omp section
function1();

#pragma omp section


function2();
}
Atomic Operations in OpenMP
● Atomic operations ensure that a memory operation (such as incrementing a variable) is performed
without interruption from other threads.
● They provide a lightweight alternative to critical sections, using hardware-level atomicity for
efficiency.
● Reduces overhead compared to #pragma omp critical or omp_lock_t.
● Ensures thread safety for simple operations like incrementing counters.

Syntax:
#pragma omp atomic
var += value;
Syntax #include <omp.h>
#include <stdio.h>
int main() {
int sum = 0;
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
#pragma omp atomic
sum += i;
}
printf("Final Sum: %d\n", sum);
return 0;
}

● #pragma omp atomic ensures that sum += i; is executed atomically.


● The hardware ensures no two threads update sum simultaneously, avoiding data races.
Locks in OpenMP
● Locks provide explicit control over thread synchronization.
● Unlike #pragma omp critical, which applies to a block, locks are more flexible and allow fine-
grained control.
● More flexible than critical sections.
● Useful when multiple critical regions need separate control.

Lock Functions:

● Initialize a lock
● Acquire a lock
● Release a lock
● Destroy a lock
Initialize a lock:
omp_lock_t lock;
omp_init_lock(&lock);

Acquire a lock:
omp_set_lock(&lock);

Release a lock:
omp_unset_lock(&lock);

Destroy a lock:
omp_destroy_lock(&lock);
Performance Consideration
Tasks in OpenMP

What is a Task in OpenMP?


● A task in OpenMP is an independent unit of work that can be executed by a thread.
● Tasks allow for dynamic parallelism, where the division of work happens at runtime instead of compile
time.

Why Use Tasks?


● Efficient for irregular workloads (e.g., recursive algorithms, graph processing).
● Better load balancing as tasks are dynamically assigned to threads.
● Overcomes limitations of static and loop-based parallelism.
Basic Syntax of OpenMP Task Directive
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
function1(); // Task 1
#pragma omp task
function2(); // Task 2
}
}

● #pragma omp single: Ensures tasks are created only once.

● #pragma omp task: Defines a unit of work.


Task Queue in OpenMP
A task queue is a FIFO (First In, First Out) queue where OpenMP stores tasks before execution. These tasks
are then assigned to worker threads dynamically.

How Tasks Enter the Queue


● When #pragma omp task is encountered inside a parallel region, OpenMP creates a task and places it
in the task queue.
● The task is not executed immediately but waits in the queue.
● Threads pick up tasks whenever they are free.
Struct node*head

#pragma omp parallel

#pragma omp single

for(ptr=head; (ptr!= null); ptr=ptr->next)

#pragma omp task

{ compute_inverse(ptr-> matrix)

}
Task Execution in OpenMP

Task Execution Process


1. Task Creation:
○ When a thread encounters #pragma omp task, it creates a new task and adds it to the task
queue.
2. Task Scheduling:
○ A worker thread fetches a task from the queue when it becomes idle.
○ If a thread runs out of tasks in its own queue, it may steal tasks from other queues (work-stealing).
3. Task Execution:
○ The thread executes the task asynchronously.
○ Other threads continue executing their assigned tasks.
4. Task Completion and Synchronization:
○ If #pragma omp taskwait is used, a thread waits for all its created tasks to complete before
proceeding.
Completion of Tasks
Using #pragma omp barrier
● Ensures all threads in a parallel region wait until all tasks finish.
#include <iostream>
#include <omp.h>
int main() {
int x = 5;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
x += 10;
printf("Task: x = %d\n", x);
}
}
#pragma omp barrier // Ensures all tasks complete
}
printf("Main: x = %d\n", x);
return 0;
}
Recursive Task Spawning in OpenMP
Recursive task spawning in OpenMP occurs when tasks create more tasks inside a recursive function. This is
useful for problems like Fibonacci sequence, merge sort, and tree traversal, where subproblems can be
solved in parallel.
#include <iostream>
#include <omp.h>

int fib(int n) {
if (n <= 1) return n;

int x, y;

#pragma omp task shared(x)


x = fib(n - 1); // Spawn task for fib(n-1)

#pragma omp task shared(y)


y = fib(n - 2); // Spawn task for fib(n-2)

#pragma omp taskwait // Ensure both tasks complete before returning


return x + y;
}

int main() {
int n = 10, result;

#pragma omp parallel


{
#pragma omp single // Ensure only one thread starts recursion
result = fib(n);
}

std::cout << "Fibonacci(" << n << ") = " << result << std::endl;
return 0;
Pitfalls of Recursive Task Spawning
(1) Excessive Task Creation (Overhead)

● Recursive calls generate an exponential number of tasks.


● Tasks may be too fine-grained, leading to high scheduling overhead.

Solution: Use a threshold to limit task creation.

if (n > 20) { // Use tasks only for large n

#pragma omp task shared(x)

x = fib(n - 1);

#pragma omp task shared(y)

y = fib(n - 2);

#pragma omp taskwait

} else {

x = fib(n - 1);

y = fib(n - 2);
(2) Load Imbalance

● Some tasks may finish quickly (e.g., fib(2)) while others take much longer (fib(40)).
● Leads to underutilization of some threads.

Solution: Use dynamic scheduling or balance task workload using task dependencies.

(3) Stack Overflow (Deep Recursion)

● If recursion depth is too large, stack overflow may occur.

Solution: Convert to iterative parallelism where possible.

(4) Lack of Synchronization (taskwait Required)

● Without #pragma omp taskwait, results may be incomplete or incorrect.


● Since tasks are executed asynchronously, returning too early may cause wrong outputs.

Solution: Always use taskwait before returning from a recursive function


(5) Nested Parallelism May Not Work Well

● OpenMP may not create new threads in deeply nested parallel regions due to thread limits.
● Default OMP_NUM_THREADS may not be enough.

Solution: Increase thread count with:

omp_set_nested(1);

omp_set_max_active_levels(4); // Allows deeper neste


Best Solutions for Recursive Task Spawning

You might also like