0% found this document useful (0 votes)

35 views71 pages

Parallel Programming Unit 2

This document provides an overview of OpenMP programming for shared memory systems, highlighting key concepts such as concurrency, synchronization, and communication. It discusses various programming models, synchronization techniques, and performance considerations, along with practical examples of using threads and OpenMP directives. Additionally, it covers the advantages of using OpenMP over lower-level threading models like Pthreads, emphasizing ease of programming and the ability to optimize performance across different hardware configurations.

Uploaded by

beinggord02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views71 pages

Parallel Programming Unit 2

Uploaded by

beinggord02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 71

Unit:2

OpenMP programming for

shared memory systems
Introduction to Shared Address Space Platforms
● Definition: Platforms where memory is accessible to multiple processors
● Implicit communication through shared memory
● Focus on concurrency and synchronization constructs

Importance of Parallel Programming

● Specification of parallel tasks and interactions
● Synchronization and communication of intermediate results
● Goal: Efficient and scalable parallel execution
Key Concepts in Shared Address Space Programming

● Concurrency: Multiple tasks running simultaneously, potentially interacting with

each other.
● Synchronization: Coordinating task execution to ensure correct results (e.g.,
preventing two tasks from writing to the same memory at the same time).
● Communication: Implicit, through shared memory, without the need for explicit
data transfers.
● Overhead Minimization: Reducing the performance cost of managing
synchronization and context switches.
Process-Based Models
● Overview: Separate processes with isolated memory spaces, communicating via
explicit mechanisms.
● Data Sharing: Achieved using system calls like:
○ shmget: Allocate shared memory
○ shmat: Attach shared memory to process address space
● Advantages:
○ Strong memory protection
○ Suitable for distributed systems
● Disadvantages:
○ High overhead due to context switching
○ Complex explicit communication
Lightweight Processes and Threads

● Threads: Smaller, faster execution units within a process, sharing the same
memory space.
● Advantages:
○ Low context-switching overhead
○ Easier data sharing (global memory access)
● Disadvantages:
○ Risk of race conditions and data corruption
○ Need for careful synchronization
● Use Cases: Scientific computing, real-time systems, high-performance computing
Directive-Based Programming Models

● Purpose: Simplify thread management and synchronization through high-level

constructs.
● Example Paradigms:
○ OpenMP: Uses compiler directives for parallel loops and sections.
○ Intel TBB: Template-based parallel programming.
○ Cilk: Task-based parallelism with work-stealing scheduler.
● Key Features:
○ Thread creation and management
○ Synchronization and reduction operations
Synchronization Techniques

● Locks & Mutexes: Ensure mutual exclusion, preventing multiple threads from
accessing critical sections simultaneously.
● Semaphores: Counters used to control access to shared resources.
● Barriers: Synchronization points where threads must wait for each other to
proceed.
● Atomic Operations: Low-level, fast synchronization primitives avoiding full lock
mechanisms.
● Condition Variables: Enable threads to wait for specific conditions before
continuing.
Performance Considerations

● Synchronization Overhead: Delays caused by locks, barriers, and other

coordination mechanisms.
● False Sharing: Cache inefficiency when threads inadvertently share cache lines.
● Load Balancing: Even distribution of work across threads to prevent bottlenecks.
● Granularity: Choosing appropriate task size to balance parallelism and overhead.
● Memory Bandwidth: Avoiding memory contention and optimizing cache usage.
Extensions and Future Directions

● Hybrid Models: Combining shared address space with message passing (e.g.,
MPI + OpenMP).
● Hardware-Assisted Synchronization: Leveraging modern CPUs' built-in atomic
instructions and cache coherence protocols.
● Adaptive Runtime Systems: Dynamically adjusting thread count and work
distribution based on runtime conditions.
● AI-Driven Scheduling: Using machine learning to predict and optimize parallel
task execution.
Threads
● Definition: A thread is a single stream of control within a program.
● Key Characteristics:
○ Independent sequence of instructions
○ Can run concurrently on multiple processors
● Example: Matrix multiplication, where each matrix element computation can be a
separate thread.
Threads in Matrix Multiplication
Sequential Code:

for (row = 0; row < n; row++)

for (column = 0; column < n; column++)
c[row][column] = dot_product(get_row(a, row), get_col(b, col));

Threaded Code:

for (row = 0; row < n; row++)

for (column = 0; column < n; column++)
c[row][column] = create_thread(dot_product(get_row(a, row), get_col(b, col)));
Explanation:

● Each iteration is a thread.

● Threads are created using create_thread.
● The system schedules these threads on available processors.
Logical Memory Model of a Thread
● Shared Address Space: All threads can access global memory.
● Thread-Local Memory: Stack variables are private to each thread.
● Why Stack is Local:
○ Threads are invoked as function calls.
○ Stack liveness is unpredictable due to runtime scheduling.
○ Treating stack data as global can lead to data corruption.
Memory Model Visualization
● Global Memory (Shared): Accessible by all threads.
● Local Memory (Private): Each thread has its own stack space
Significance of Thread
Software Portability
Key Advantage:
Threaded applications can run on both serial and parallel machines without modification.

Implications:

● Easier migration between platforms: You don’t need to change your code when switching
from a single-core to a multi-core machine. Threads handle parallelism naturally.
● Cost and resource savings: Less need for specialized, high-performance hardware (like
supercomputers) — multi-threading can optimize performance on standard machines.

Example:
Imagine you’re writing a program that processes images. Using threads, your program can work
on different parts of the image simultaneously. This works whether your machine has 2 cores or
16 cores, without changing the code.
Latency Hiding
Problem:
High overheads from memory access, I/O, and communication can slow down applications.

Solution:
Threads can mask latency. When one thread is waiting for data (like reading from a file or fetching
memory), other threads can continue executing, keeping the CPU busy.

Example:
In a web server, one thread can handle an incoming request while another waits for a database
response. This reduces idle time and increases throughput.
Scheduling and Load Balancing
Challenge:
Distributing work evenly across processors:

● Structured applications (like matrix multiplication): Easy to split tasks evenly.

● Dynamic or unstructured applications (like recursive algorithms): Harder to balance, as work
may grow unpredictably.

Threaded API Solution:

● Threads let you define many small concurrent tasks.

● The system dynamically maps these tasks to available processors, reducing idle time and
optimizing performance.

Example:
If you have 100 tasks and 4 processors, threads can dynamically assign new tasks to a processor as
soon as it finishes a previous one.
Ease of Programming & Widespread Use
Advantage:
Threaded programs are often easier to write and understand than message-passing programs (like MPI).

POSIX Thread API:

● Industry-standard for thread management in C/C++.

● Well-supported with libraries, documentation, and development tools.

Example:
Creating a thread in POSIX is as simple as calling pthread_create(), and you can join threads with
pthread_join(). This is much simpler than manually setting up message-passing protocols.
Trade-offs and Considerations
1. Performance Tuning:
Getting the best performance might still require effort — like deciding the optimal number of threads
or avoiding contention for shared resources.
2. Synchronization:
Threads share memory, so careful handling is required to prevent:
● Race conditions: Multiple threads access and modify the same variable simultaneously.
● Deadlocks: Threads wait indefinitely for resources locked by each other.
OpenMP
OpenMP (Open Multi-Processing) is a directive-based parallel programming API that
works with C, C++, and Fortran to simplify writing programs for shared address space
machines (i.e., systems where multiple processors share the same memory).

The key idea is that instead of manually handling threads (like in Pthreads), OpenMP
allows you to write compiler directives that tell the compiler how to split up tasks into
threads. This makes it much easier for application programmers to add parallelism to
their code!
Why Use OpenMP Instead of Pthreads?

While Pthreads (POSIX Threads) gives you fine-grained control, it’s low-level and
complex. You must:

● Create and manage threads

● Handle synchronization with mutexes and condition variables
● Manually define thread-local vs shared data

In contrast, OpenMP abstracts away this complexity. You use simple #pragma directives,
and OpenMP takes care of:

● Creating and managing threads

● Handling concurrency and synchronization
● Managing data scoping (private vs shared variables)
The OpenMP Programming Model

OpenMP programs generally follow this model:

1. Start with a single thread (master thread).

2. Encounter a parallel directive — OpenMP creates a team of threads.
3. Execute the parallel block with all threads.
4. Merge threads back into the master thread at the end of the parallel block.
Syntax of OpenMP Directives
Directives in OpenMP use the #pragma compiler directive

#pragma omp directive [clause list]

Here:

● #pragma omp: Tells the compiler it’s an OpenMP directive.

● directive: Specifies the parallel operation (e.g., parallel to create threads).
● clause list: Optional clauses to control parallelism (like number of threads, data handling).

Example:

#pragma omp parallel num_threads(4)

{ printf("Hello from thread!\n"); }

This code prints the message 4 times because 4 threads are created!
What is a Pragma?

● Pragma: A compiler directive that provides additional information.

● In C/C++, a pragma starts with #.
● Syntax:
#pragma omp <directive>
● OpenMP pragmas tell the compiler how to parallelize sections of code.
The parallel Directive

The most basic OpenMP directive is parallel, which spawns multiple threads

#pragma omp parallel [clause list]

// Parallel region (executed by multiple threads)

How it works:

● The thread that reaches this directive becomes the master thread (ID = 0).
● New threads are created — number of threads can be specified or decided
automatically.
● Each thread runs the code inside the block independently.
The Parallel for Pragma
● #pragma omp parallel for: Tells the compiler to parallelize the following loop.

Example:

#pragma omp parallel for

for (i = first; i < size; i += prime)
marked[i] = 1;

● Compiler generates code to split iterations across threads.

Loop Conditions for Parallelization
● For successful parallelization, loops must follow a canonical shape:
○ Single loop index variable.
○ Linear iteration pattern (e.g., index += inc).
○ No premature exits (break, return, exit).
Valid Loop Shapes:
for (index = start; index < end; index++)
for (index = start; index <= end; index += inc)

InValid Loop Shapes:

for (index = start; index < end; index++) {
if (some_condition)
break; // Not allowed
How OpenMP Handles Threads
● Master Thread: Creates and coordinates worker threads.
● Worker Threads: Execute chunks of iterations.

Each thread gets its own execution context:

● Private variables: Unique to each thread.

● Shared variables: Accessible by all threads.

Example:
int main() {
int b[10];
int i;
#pragma omp parallel for
for (i = 0; i < 10; i++)
b[i] = i;
}

i: Private

b: Shared
Scheduling Loops in OpenMP
● Scheduling: Determines how iterations are assigned to threads.
● Types of Scheduling:
○ Static: Equal-sized chunks assigned at compile time.
○ Dynamic: Iterations assigned at runtime.
○ Guided: Threads grab decreasing chunk sizes.

Example:

#pragma omp parallel for schedule(type)

for (int i = 0; i < 100; i++) {
// Loop body
}
Static Scheduling
● Equal Partitioning: Iteration space divided into equal chunks.
● Syntax:
#pragma omp for schedule(static)
● Example: Matrix Multiplication (Outer loop parallelized)
#pragma omp parallel default(private) shared(a, b, c, dim) \
num_threads(4)
#pragma omp for schedule(static)
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j++) {
c(i, j) = 0;
for (k = 0; k < dim; k++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
Dynamic Scheduling

● Adaptive Work Distribution: Chunks assigned to idle threads.

● Syntax:
schedule(dynamic[, chunk-size])
● Use Case: When loop iterations take variable time.

Guided Scheduling

● Exponential Chunk Reduction: Chunk size decreases as iterations progress.

● Syntax:
schedule(guided[, chunk-size])
● Advantage: Reduces idling by balancing workloads.
omp_get __ num_procs and omp _ set _num _ threads
omp_get __ num_procs
● omp_get __ num_procs returns the number of physical processors available for use by the parallel
program.
● Here is the function header: int omp_get_num_procs (void)
● The integer returned by this function may be less than the total number of physical processors in the
multiprocessor, depending on how the run-time system gives processes access to processors
omp _ set _num _ threads
● Function omp _ set _num _ threads uses the parameter value 10 set the number of threads to be
active in parallel sections of code.
● It has this function header: void omp _ set _num _ threads (int t)
● Since this function may be called at multiple points in a program, you have the ability to tailor the level
of parallelism to the grain size or other characteristics of the code block.
Declaring Private Variable

● OpenMP allows parallelization of loops for performance gains.

● Variables accessed by threads need careful management.
● Default behavior: loop index is private, other variables are shared.

Problem: Shared variables can cause race conditions in parallel loops.

Solution: Use private clauses to control variable scope.

○ private(variable_list): Each thread gets its own copy of the variables.
○ firstprivate(variable_list): Same as private, but initializes variables with their original
values.
○ shared(variable_list): Variables are shared among threads (need synchronization!).
The Private Clause
● Syntax:
#pragma omp parallel for private(variable_list)
● Creates a private copy of variables for each thread.
● Example:
#pragma omp parallel for private(j)
for (i = 0; i < BLOCK_SIZE(id, p, n); i++)
for (j = 0; j < n; j++)
a[i][j] = min(a[i][j], a[i][k] + tmp[j]);

● The private copies of variable j will be accessible only inside the for loop. The values are undefined
on loop entry and exit.
● Even if j had a previously assigned value before entering the parallel for loop, none of the threads
can access that value. Similarly, whatever values the threads assign to j during execution of the
parallel for loop, the value of the shared j will not be affected.
The firstprivate Clause
● Sometimes we want a private variable to inherit the value of the shared variable.
● The firstprivate clause, with syntax
#pragma omp parallel for private(j) firstprivate(x)
● It directs the compiler to create private variables having initial values identical to the value of the
variable controlled by the master thread as the loop is entered.
● Example:
x[0] = complex_function();
#pragma omp parallel for private(j) firstprivate(x)
for (i = 0; i < n; i++) {
for (j = 1; j < 4; j++)
x[j] = g(i, x[j-1]);
answer[i] = x[1] - x[3]; }
● x[0] is initialized for each thread.
● Threads start with the same initial value but modify their copies.
The Lastprivate Clause
● Saves the value of a private variable from the last sequential iteration.
● Useful for preserving final results of a computation.

Syntax:

#pragma omp parallel for private(j) firstprivate(x)

Example:

#pragma omp parallel for private(j) lastprivate(x)

for (i = 0; i < n; i++) {
x[0] = 1.0;
for (j = 1; j < 4; j++)
x[j] = x[j-1] * (i + 1);
sum_of_powers[i] = x[0] + x[1] + x[2] + x[3];
}
n_cubed = x[3];

In the sequentially Last iteration of the loop, x [3] gets assigned the value n3. In order to have this, value
accessible outside the parallel for loop, we must declare x to be a lastprivate variable.
Combining Clauses
● A parallel for pragma can use both firstprivate and lastprivate.
● Useful for initialization and final result capture.

Example:

#pragma omp parallel for private(j) firstprivate(x) lastprivate(y)

● x: Initialized from master thread, private for each thread.

● y: Private for each thread, but final value saved.
Loop Inversion and Performance Restoration
● Loop inversion is a technique used to improve the performance of parallel programs by restructuring
nested loops. It helps reduce synchronization overhead and improves cache efficiency.

Problem in the Original Loop

for (i = 1; i < m; i++)

for (j = 0; j < n; j++)

a[i][j] = 2 * a[i-1][j];

● The outer loop (i) cannot be parallelized because each iteration depends on the previous iteration (a[i-
1][j]).
● This introduces data dependencies that force serial execution.
● Parallelizing the inner loop (j) is possible, but it results in high fork-join overhead since the outer loop
runs sequentially.
Loop Inversion to Improve Performance

By swapping the order of loops:

#pragma omp parallel for private(i)

for (j = 0; j < n; j++)

for (i = 1; i < m; i++)

a[i][j] = 2 * a[i-1][j];

● The outer loop (j) is now parallelized.

● The data dependency (across i) remains intact, but the number of fork-join steps is
reduced from m-1 to 1, significantly improving performance.
● Performance is restored by minimizing synchronization costs.
Why This Works?
● Row-wise Dependency: Since each row depends on the previous one, parallelizing i would
cause conflicts.
● Column Independence: Columns can be updated in parallel, ensuring better efficiency.
● Cache Efficiency: Since C uses row-major order, working with columns may reduce cache
locality but can still be beneficial depending on system architecture.
Performance Trade-Offs

● Synchronization Overhead: Drastically reduced as only one parallel section is created

instead of multiple.
● Cache Hit Rate: Might reduce because elements are accessed column-wise, which
depends on m, n, and thread count.
● Memory Bandwidth: Improved since threads work independently on different parts of
the array.
single Pragma

● Problem: We need to parallelize j but ensure some parts of i execute only once.
● Solution: #pragma omp single ensures only one thread executes a block.
● Syntax:

#pragma omp single

Code Example with single Pragma
Before using single:
#pragma omp parallel private(i, j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high)
printf("Exiting during iteration %d\n", i);
break;
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i]) / b[i];
}
⚠ Issue: The error message might print multiple times.
Corrected Code with single Pragma
#pragma omp parallel private(i, j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
#pragma omp single
printf("Exiting during iteration %d\n", i);
break;
}
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i]) / b[i];
}
● Fix: single ensures that only one thread prints the message.
nowait Clause

Issue: OpenMP adds a synchronization barrier at the end of for loops.

Why is this a problem?

● All threads must finish before moving forward, even when unnecessary.

Solution: Use #pragma omp for nowait to skip unnecessary barriers.

Optimized Code with nowait
#pragma omp parallel private(i, j, low, high)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
#pragma omp single
printf("Exiting during iteration %d\n", i);
break;
}
#pragma omp for nowait
for (j = low; j < high; j++)
c[j] = (c[j] - a[i]) / b[i];
}
● Optimization:
○ low and high are made private, so no dependencies exist.
○ nowait avoids unnecessary synchronization.
Race Conditions
● Race Condition: Occurs when multiple threads read and write shared data
unpredictably.
● Example: Computing π using numerical integration.
Example
#include <omp.h>
#include <stdio.h>
int main() {
int i, n = 10000;
double pi = 0.0, x;
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i + 0.5) / n;
pi += 4.0 / (1.0 + x * x); // Race condition occurs here
}
pi /= n;
printf("Value of Pi: %f\n", pi);
return 0;
}

●
Problem: Multiple threads modify pi simultaneously, leading to incorrect results.
●
Why Do Race Conditions Occur?
● In parallel execution, threads operate independently and may interleave unpredictably.
● Example:
○ Thread A reads pi = 0.5.
○ Thread B reads pi = 0.5, computes, and updates pi = 0.7.
○ Thread A updates pi = 0.6, overwriting the update from Thread B.
● Final result is incorrect due to lost updates.
Using Critical Sections to Avoid Race Conditions
● Critical Section ensures only one thread accesses a shared variable at a time.
● OpenMP provides #pragma omp critical for this purpose.

Corrected Code using Critical Sections

#pragma omp critical

pi += 4.0 / (1.0 + x * x);

● Advantages:
○ Ensures correctness.
● Disadvantages:
○ Slows down performance due to serialization.
Performance Issues with Critical Sections
● Only one thread executes inside the critical section at a time, reducing parallel efficiency.
● Example Performance Comparison:

Critical Sections reduce parallel efficiency as the number of threads increases.

Using OpenMP Reduction for Efficiency

● Reduction Clause provides a more efficient way to handle accumulation.

● Each thread maintains a local copy of the variable and performs partial accumulation.
● OpenMP combines these partial results at the end.

Corrected Code using OpenMP Reduction

#pragma omp parallel for private(x) reduction(+:pi)

for (i = 0; i < n; i++) {

x = (i + 0.5) / n;

pi += 4.0 / (1.0 + x * x);

}
● Advantages:
○ Eliminates race conditions.
○ More efficient than critical sections.
Comparison of Critical Sections vs. Reduction Clause

Reduction is preferred for summation and similar operations.

Performance Analysis of Reduction vs Critical Section

● Comparison of execution time using reduction vs. critical sections:

Observation: Reduction scales better with the number of threads.

Parallel Section

● A section of code that can be executed concurrently by multiple threads.

● Allows functional parallelism where different functions execute simultaneously.
● Syntax:

#pragma omp parallel sections

{
#pragma omp section
function1();

#pragma omp section

function2();
}
Atomic Operations in OpenMP
● Atomic operations ensure that a memory operation (such as incrementing a variable) is performed
without interruption from other threads.
● They provide a lightweight alternative to critical sections, using hardware-level atomicity for
efficiency.
● Reduces overhead compared to #pragma omp critical or omp_lock_t.
● Ensures thread safety for simple operations like incrementing counters.

Syntax:
#pragma omp atomic
var += value;
Syntax #include <omp.h>
#include <stdio.h>
int main() {
int sum = 0;
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
#pragma omp atomic
sum += i;
}
printf("Final Sum: %d\n", sum);
return 0;
}

● #pragma omp atomic ensures that sum += i; is executed atomically.

● The hardware ensures no two threads update sum simultaneously, avoiding data races.
Locks in OpenMP
● Locks provide explicit control over thread synchronization.
● Unlike #pragma omp critical, which applies to a block, locks are more flexible and allow fine-
grained control.
● More flexible than critical sections.
● Useful when multiple critical regions need separate control.

Lock Functions:

● Initialize a lock
● Acquire a lock
● Release a lock
● Destroy a lock
Initialize a lock:
omp_lock_t lock;
omp_init_lock(&lock);

Acquire a lock:
omp_set_lock(&lock);

Release a lock:
omp_unset_lock(&lock);

Destroy a lock:
omp_destroy_lock(&lock);
Performance Consideration
Tasks in OpenMP

What is a Task in OpenMP?

● A task in OpenMP is an independent unit of work that can be executed by a thread.
● Tasks allow for dynamic parallelism, where the division of work happens at runtime instead of compile
time.

Why Use Tasks?

● Efficient for irregular workloads (e.g., recursive algorithms, graph processing).
● Better load balancing as tasks are dynamically assigned to threads.
● Overcomes limitations of static and loop-based parallelism.
Basic Syntax of OpenMP Task Directive
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
function1(); // Task 1
#pragma omp task
function2(); // Task 2
}
}

● #pragma omp single: Ensures tasks are created only once.

● #pragma omp task: Defines a unit of work.

Task Queue in OpenMP
A task queue is a FIFO (First In, First Out) queue where OpenMP stores tasks before execution. These tasks
are then assigned to worker threads dynamically.

How Tasks Enter the Queue

● When #pragma omp task is encountered inside a parallel region, OpenMP creates a task and places it
in the task queue.
● The task is not executed immediately but waits in the queue.
● Threads pick up tasks whenever they are free.
Struct node*head

#pragma omp parallel

#pragma omp single

for(ptr=head; (ptr!= null); ptr=ptr->next)

#pragma omp task

{ compute_inverse(ptr-> matrix)

}
Task Execution in OpenMP

Task Execution Process

1. Task Creation:
○ When a thread encounters #pragma omp task, it creates a new task and adds it to the task
queue.
2. Task Scheduling:
○ A worker thread fetches a task from the queue when it becomes idle.
○ If a thread runs out of tasks in its own queue, it may steal tasks from other queues (work-stealing).
3. Task Execution:
○ The thread executes the task asynchronously.
○ Other threads continue executing their assigned tasks.
4. Task Completion and Synchronization:
○ If #pragma omp taskwait is used, a thread waits for all its created tasks to complete before
proceeding.
Completion of Tasks
Using #pragma omp barrier
● Ensures all threads in a parallel region wait until all tasks finish.
#include <iostream>
#include <omp.h>
int main() {
int x = 5;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
x += 10;
printf("Task: x = %d\n", x);
}
}
#pragma omp barrier // Ensures all tasks complete
}
printf("Main: x = %d\n", x);
return 0;
}
Recursive Task Spawning in OpenMP
Recursive task spawning in OpenMP occurs when tasks create more tasks inside a recursive function. This is
useful for problems like Fibonacci sequence, merge sort, and tree traversal, where subproblems can be
solved in parallel.
#include <iostream>
#include <omp.h>

int fib(int n) {
if (n <= 1) return n;

int x, y;

#pragma omp task shared(x)

x = fib(n - 1); // Spawn task for fib(n-1)

#pragma omp task shared(y)

y = fib(n - 2); // Spawn task for fib(n-2)

#pragma omp taskwait // Ensure both tasks complete before returning

return x + y;
}

int main() {
int n = 10, result;

#pragma omp parallel

{
#pragma omp single // Ensure only one thread starts recursion
result = fib(n);
}

std::cout << "Fibonacci(" << n << ") = " << result << std::endl;
return 0;
Pitfalls of Recursive Task Spawning
(1) Excessive Task Creation (Overhead)

● Recursive calls generate an exponential number of tasks.

● Tasks may be too fine-grained, leading to high scheduling overhead.

Solution: Use a threshold to limit task creation.

if (n > 20) { // Use tasks only for large n

#pragma omp task shared(x)

x = fib(n - 1);

#pragma omp task shared(y)

y = fib(n - 2);

#pragma omp taskwait

} else {

x = fib(n - 1);

y = fib(n - 2);
(2) Load Imbalance

● Some tasks may finish quickly (e.g., fib(2)) while others take much longer (fib(40)).
● Leads to underutilization of some threads.

Solution: Use dynamic scheduling or balance task workload using task dependencies.

(3) Stack Overflow (Deep Recursion)

● If recursion depth is too large, stack overflow may occur.

Solution: Convert to iterative parallelism where possible.

(4) Lack of Synchronization (taskwait Required)

● Without #pragma omp taskwait, results may be incomplete or incorrect.

● Since tasks are executed asynchronously, returning too early may cause wrong outputs.

Solution: Always use taskwait before returning from a recursive function

(5) Nested Parallelism May Not Work Well

● OpenMP may not create new threads in deeply nested parallel regions due to thread limits.
● Default OMP_NUM_THREADS may not be enough.

Solution: Increase thread count with:

omp_set_nested(1);

omp_set_max_active_levels(4); // Allows deeper neste

Best Solutions for Recursive Task Spawning

Case Study Cathay Pacific
No ratings yet
Case Study Cathay Pacific
8 pages
CC Magic User Guide
0% (1)
CC Magic User Guide
15 pages
CSV Good Documentation and Test Practices For GXP
0% (1)
CSV Good Documentation and Test Practices For GXP
16 pages
6.ika Owonrin
90% (10)
6.ika Owonrin
11 pages
About OpenMP
No ratings yet
About OpenMP
86 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
No ratings yet
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
22 pages
CS-3006 8 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 8 UsingOpenMP SharedMemoryProgramming
61 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
11-Programming With OpenMP
No ratings yet
11-Programming With OpenMP
28 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Lecture 10 Shared Memory Programming With OpenMP
No ratings yet
Lecture 10 Shared Memory Programming With OpenMP
30 pages
Omp Hands On SC08
No ratings yet
Omp Hands On SC08
153 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
MAP - Unit2
No ratings yet
MAP - Unit2
134 pages
4 Threads
No ratings yet
4 Threads
41 pages
21th 22th Lecture
No ratings yet
21th 22th Lecture
22 pages
Parallel Programming: Process and Threads
No ratings yet
Parallel Programming: Process and Threads
18 pages
PDC Lecture 05
No ratings yet
PDC Lecture 05
48 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
High Performance Computing
No ratings yet
High Performance Computing
67 pages
OS Module 1 Slides-2
No ratings yet
OS Module 1 Slides-2
47 pages
Threads
No ratings yet
Threads
38 pages
Openmp
No ratings yet
Openmp
61 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
Module 2 Pptos
No ratings yet
Module 2 Pptos
145 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
Lect9 Pthread
No ratings yet
Lect9 Pthread
24 pages
HPC - Unit 3
No ratings yet
HPC - Unit 3
15 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
Shared Memory: Openmp Environment and Synchronization
No ratings yet
Shared Memory: Openmp Environment and Synchronization
32 pages
Openmp
No ratings yet
Openmp
21 pages
4 Openmp
No ratings yet
4 Openmp
32 pages
Cs6801 Mcap MGM
No ratings yet
Cs6801 Mcap MGM
7 pages
5 Threads
No ratings yet
5 Threads
33 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
Unit 3
No ratings yet
Unit 3
13 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Open MP
No ratings yet
Open MP
35 pages
Openmp: Parallel Processing
No ratings yet
Openmp: Parallel Processing
40 pages
CS-3006 5 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 5 UsingOpenMP SharedMemoryProgramming
76 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Lecture 4
No ratings yet
Lecture 4
38 pages
Operating Systems-3-Threads 3
No ratings yet
Operating Systems-3-Threads 3
17 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
10 OpenMP-2
No ratings yet
10 OpenMP-2
25 pages
Introduction To Open MP
No ratings yet
Introduction To Open MP
42 pages
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
OpenMP in Practice: Definitive Reference for Developers and Engineers
From Everand
OpenMP in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet
Ref 3
No ratings yet
Ref 3
3 pages
2022CSC1048 Sec
No ratings yet
2022CSC1048 Sec
5 pages
ML Practical Guidebook
No ratings yet
ML Practical Guidebook
3 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
Bikila
No ratings yet
Bikila
6 pages
DOM Short Q & A
No ratings yet
DOM Short Q & A
13 pages
Alima
No ratings yet
Alima
13 pages
Online Grading System Thesis
100% (3)
Online Grading System Thesis
8 pages
Speaking-card-for-Business-English-1-Students - Final
No ratings yet
Speaking-card-for-Business-English-1-Students - Final
11 pages
Technology and Development of 800 KV HVDC Applications: M. Haeusler H. Huang V. Ramaswami D. Kumar
No ratings yet
Technology and Development of 800 KV HVDC Applications: M. Haeusler H. Huang V. Ramaswami D. Kumar
12 pages
Blank Company Profile Business Presentation in Orange Pink Yellow 3D Style
No ratings yet
Blank Company Profile Business Presentation in Orange Pink Yellow 3D Style
14 pages
Criminal Detection On Open CV
No ratings yet
Criminal Detection On Open CV
57 pages
Cannabis Analytical Method
No ratings yet
Cannabis Analytical Method
78 pages
Power BI
No ratings yet
Power BI
146 pages
English PAS
No ratings yet
English PAS
12 pages
Topic # 4:: Risk Response: Designing Overall Responses and Further Audit Procedures
No ratings yet
Topic # 4:: Risk Response: Designing Overall Responses and Further Audit Procedures
3 pages
Siem
No ratings yet
Siem
8 pages
Aabfj 1599 Choudhari
No ratings yet
Aabfj 1599 Choudhari
20 pages
Cookery 9 Module 02
No ratings yet
Cookery 9 Module 02
25 pages
Good Hair Day: New Technique Grows Tiny 'Hairy' Materials at The Microscale
No ratings yet
Good Hair Day: New Technique Grows Tiny 'Hairy' Materials at The Microscale
1 page
Income Tax Calculation F.Y.2019-20 AGIPC1111K Particulars Amount
No ratings yet
Income Tax Calculation F.Y.2019-20 AGIPC1111K Particulars Amount
3 pages
Annaika Dastine's Newly Released "A Daughter's Cry and A Father's Response: Inspirational Devotions" Is A Heartfelt Exploration of Spiritual Growth
No ratings yet
Annaika Dastine's Newly Released "A Daughter's Cry and A Father's Response: Inspirational Devotions" Is A Heartfelt Exploration of Spiritual Growth
4 pages
Morris Minor Restoration Parts Manual
No ratings yet
Morris Minor Restoration Parts Manual
98 pages
Basics of Energy and Environment ESE-2017 Prelims Paper - 1 General Studies & Engineering Aptitude
100% (1)
Basics of Energy and Environment ESE-2017 Prelims Paper - 1 General Studies & Engineering Aptitude
8 pages
Interpreting Test Scores
No ratings yet
Interpreting Test Scores
26 pages
Software Security Engineering A Guide For Project ... - (Cover)
No ratings yet
Software Security Engineering A Guide For Project ... - (Cover)
5 pages
Park Mobile
No ratings yet
Park Mobile
2 pages
Ielts Listening - Unit Test DATE: FULL NAME: Instruction: Write Your Answers On Your Answer Sheet.
No ratings yet
Ielts Listening - Unit Test DATE: FULL NAME: Instruction: Write Your Answers On Your Answer Sheet.
5 pages
Gods and Monsters, Volume I
No ratings yet
Gods and Monsters, Volume I
234 pages
Twinkle Twinkle Little Star
No ratings yet
Twinkle Twinkle Little Star
2 pages