Parallel Programming Unit 2
Parallel Programming Unit 2
● Threads: Smaller, faster execution units within a process, sharing the same
memory space.
● Advantages:
○ Low context-switching overhead
○ Easier data sharing (global memory access)
● Disadvantages:
○ Risk of race conditions and data corruption
○ Need for careful synchronization
● Use Cases: Scientific computing, real-time systems, high-performance computing
Directive-Based Programming Models
● Locks & Mutexes: Ensure mutual exclusion, preventing multiple threads from
accessing critical sections simultaneously.
● Semaphores: Counters used to control access to shared resources.
● Barriers: Synchronization points where threads must wait for each other to
proceed.
● Atomic Operations: Low-level, fast synchronization primitives avoiding full lock
mechanisms.
● Condition Variables: Enable threads to wait for specific conditions before
continuing.
Performance Considerations
● Hybrid Models: Combining shared address space with message passing (e.g.,
MPI + OpenMP).
● Hardware-Assisted Synchronization: Leveraging modern CPUs' built-in atomic
instructions and cache coherence protocols.
● Adaptive Runtime Systems: Dynamically adjusting thread count and work
distribution based on runtime conditions.
● AI-Driven Scheduling: Using machine learning to predict and optimize parallel
task execution.
Threads
● Definition: A thread is a single stream of control within a program.
● Key Characteristics:
○ Independent sequence of instructions
○ Can run concurrently on multiple processors
● Example: Matrix multiplication, where each matrix element computation can be a
separate thread.
Threads in Matrix Multiplication
Sequential Code:
Threaded Code:
Implications:
● Easier migration between platforms: You don’t need to change your code when switching
from a single-core to a multi-core machine. Threads handle parallelism naturally.
● Cost and resource savings: Less need for specialized, high-performance hardware (like
supercomputers) — multi-threading can optimize performance on standard machines.
Example:
Imagine you’re writing a program that processes images. Using threads, your program can work
on different parts of the image simultaneously. This works whether your machine has 2 cores or
16 cores, without changing the code.
Latency Hiding
Problem:
High overheads from memory access, I/O, and communication can slow down applications.
Solution:
Threads can mask latency. When one thread is waiting for data (like reading from a file or fetching
memory), other threads can continue executing, keeping the CPU busy.
Example:
In a web server, one thread can handle an incoming request while another waits for a database
response. This reduces idle time and increases throughput.
Scheduling and Load Balancing
Challenge:
Distributing work evenly across processors:
Example:
If you have 100 tasks and 4 processors, threads can dynamically assign new tasks to a processor as
soon as it finishes a previous one.
Ease of Programming & Widespread Use
Advantage:
Threaded programs are often easier to write and understand than message-passing programs (like MPI).
Example:
Creating a thread in POSIX is as simple as calling pthread_create(), and you can join threads with
pthread_join(). This is much simpler than manually setting up message-passing protocols.
Trade-offs and Considerations
1. Performance Tuning:
Getting the best performance might still require effort — like deciding the optimal number of threads
or avoiding contention for shared resources.
2. Synchronization:
Threads share memory, so careful handling is required to prevent:
● Race conditions: Multiple threads access and modify the same variable simultaneously.
● Deadlocks: Threads wait indefinitely for resources locked by each other.
OpenMP
OpenMP (Open Multi-Processing) is a directive-based parallel programming API that
works with C, C++, and Fortran to simplify writing programs for shared address space
machines (i.e., systems where multiple processors share the same memory).
The key idea is that instead of manually handling threads (like in Pthreads), OpenMP
allows you to write compiler directives that tell the compiler how to split up tasks into
threads. This makes it much easier for application programmers to add parallelism to
their code!
Why Use OpenMP Instead of Pthreads?
While Pthreads (POSIX Threads) gives you fine-grained control, it’s low-level and
complex. You must:
In contrast, OpenMP abstracts away this complexity. You use simple #pragma directives,
and OpenMP takes care of:
Here:
Example:
This code prints the message 4 times because 4 threads are created!
What is a Pragma?
The most basic OpenMP directive is parallel, which spawns multiple threads
How it works:
● The thread that reaches this directive becomes the master thread (ID = 0).
● New threads are created — number of threads can be specified or decided
automatically.
● Each thread runs the code inside the block independently.
The Parallel for Pragma
● #pragma omp parallel for: Tells the compiler to parallelize the following loop.
Example:
Example:
int main() {
int b[10];
int i;
#pragma omp parallel for
for (i = 0; i < 10; i++)
b[i] = i;
}
i: Private
b: Shared
Scheduling Loops in OpenMP
● Scheduling: Determines how iterations are assigned to threads.
● Types of Scheduling:
○ Static: Equal-sized chunks assigned at compile time.
○ Dynamic: Iterations assigned at runtime.
○ Guided: Threads grab decreasing chunk sizes.
Example:
Guided Scheduling
● The private copies of variable j will be accessible only inside the for loop. The values are undefined
on loop entry and exit.
● Even if j had a previously assigned value before entering the parallel for loop, none of the threads
can access that value. Similarly, whatever values the threads assign to j during execution of the
parallel for loop, the value of the shared j will not be affected.
The firstprivate Clause
● Sometimes we want a private variable to inherit the value of the shared variable.
● The firstprivate clause, with syntax
#pragma omp parallel for private(j) firstprivate(x)
● It directs the compiler to create private variables having initial values identical to the value of the
variable controlled by the master thread as the loop is entered.
● Example:
x[0] = complex_function();
#pragma omp parallel for private(j) firstprivate(x)
for (i = 0; i < n; i++) {
for (j = 1; j < 4; j++)
x[j] = g(i, x[j-1]);
answer[i] = x[1] - x[3]; }
● x[0] is initialized for each thread.
● Threads start with the same initial value but modify their copies.
The Lastprivate Clause
● Saves the value of a private variable from the last sequential iteration.
● Useful for preserving final results of a computation.
Syntax:
Example:
In the sequentially Last iteration of the loop, x [3] gets assigned the value n3. In order to have this, value
accessible outside the parallel for loop, we must declare x to be a lastprivate variable.
Combining Clauses
● A parallel for pragma can use both firstprivate and lastprivate.
● Useful for initialization and final result capture.
Example:
a[i][j] = 2 * a[i-1][j];
● The outer loop (i) cannot be parallelized because each iteration depends on the previous iteration (a[i-
1][j]).
● This introduces data dependencies that force serial execution.
● Parallelizing the inner loop (j) is possible, but it results in high fork-join overhead since the outer loop
runs sequentially.
Loop Inversion to Improve Performance
a[i][j] = 2 * a[i-1][j];
● Problem: We need to parallelize j but ensure some parts of i execute only once.
● Solution: #pragma omp single ensures only one thread executes a block.
● Syntax:
● All threads must finish before moving forward, even when unnecessary.
●
Problem: Multiple threads modify pi simultaneously, leading to incorrect results.
●
Why Do Race Conditions Occur?
● In parallel execution, threads operate independently and may interleave unpredictably.
● Example:
○ Thread A reads pi = 0.5.
○ Thread B reads pi = 0.5, computes, and updates pi = 0.7.
○ Thread A updates pi = 0.6, overwriting the update from Thread B.
● Final result is incorrect due to lost updates.
Using Critical Sections to Avoid Race Conditions
● Critical Section ensures only one thread accesses a shared variable at a time.
● OpenMP provides #pragma omp critical for this purpose.
● Advantages:
○ Ensures correctness.
● Disadvantages:
○ Slows down performance due to serialization.
Performance Issues with Critical Sections
● Only one thread executes inside the critical section at a time, reducing parallel efficiency.
● Example Performance Comparison:
x = (i + 0.5) / n;
}
● Advantages:
○ Eliminates race conditions.
○ More efficient than critical sections.
Comparison of Critical Sections vs. Reduction Clause
Syntax:
#pragma omp atomic
var += value;
Syntax #include <omp.h>
#include <stdio.h>
int main() {
int sum = 0;
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
#pragma omp atomic
sum += i;
}
printf("Final Sum: %d\n", sum);
return 0;
}
Lock Functions:
● Initialize a lock
● Acquire a lock
● Release a lock
● Destroy a lock
Initialize a lock:
omp_lock_t lock;
omp_init_lock(&lock);
Acquire a lock:
omp_set_lock(&lock);
Release a lock:
omp_unset_lock(&lock);
Destroy a lock:
omp_destroy_lock(&lock);
Performance Consideration
Tasks in OpenMP
{ compute_inverse(ptr-> matrix)
}
Task Execution in OpenMP
int fib(int n) {
if (n <= 1) return n;
int x, y;
int main() {
int n = 10, result;
std::cout << "Fibonacci(" << n << ") = " << result << std::endl;
return 0;
Pitfalls of Recursive Task Spawning
(1) Excessive Task Creation (Overhead)
x = fib(n - 1);
y = fib(n - 2);
} else {
x = fib(n - 1);
y = fib(n - 2);
(2) Load Imbalance
● Some tasks may finish quickly (e.g., fib(2)) while others take much longer (fib(40)).
● Leads to underutilization of some threads.
Solution: Use dynamic scheduling or balance task workload using task dependencies.
● OpenMP may not create new threads in deeply nested parallel regions due to thread limits.
● Default OMP_NUM_THREADS may not be enough.
omp_set_nested(1);