L04 Parallel Programming Models I
L04 Parallel Programming Models I
Part I
Lecture 04
Outline
◼ Parallelism and Types of Parallelism
◼ Parallel Programming Models
❑ Models of Coordination
❑ Program Parallelization
❑ Parallel Programming Patterns
◼ Summary
Task parallelism
◼ Example:
for (i = 0; i < N; i++)
a[i] = b[i-1] + c[i]
[ CS3210 - AY24/25S1 - L04 ]
6
Loop Parallelism – aka Data Parallelism
◼ Many algorithms perform computations by iteratively
traversing a large data structure
❑ Commonly expressed as a loop
Same program
executed by p
processing units.
"me" is the
processing units
index (0 to p-1)
◼ Further decomposition:
❑ A single task can be executed sequentially by one processing units,
or in parallel by multiple processing units
◼ Properties:
❑ Critical path length: Maximum (slowest) completion time
❑ Degree of concurrency = Total Work / Critical Path Length
◼ An indication of the amount of work that can be done concurrently
[ CS3210 - AY24/25S1 - L04 ]
14
Task Dependence Graph - Example
◼ Decompositions A and B can be visualized as:
TA#1 TA#1
TA#3 TA#3
Functional Implicit
Automatic Explicit scheduling
programming scheduling
parallelization
languages OpenMP
Haskell
Implicit mapping Explicit mapping
BSPLib
Fine-Grain
A sequence of instructions
A sequence of statements where each
statement consists of several instructions
2. Communication
3. Agglomeration
4. Mapping
• Map tasks to processors (cores), with the goals of minimizing total execution time
Partitioning
Problem Communication
Mapping Agglomeration
Atmospheric Model
wind velocity
sea
surface
temperature
◼ Local communication
❑ Task needs data from a small number of other tasks (“neighbors”)
❑ Create channels illustrating data flow
◼ Global communication
❑ Significant number of tasks contribute data to perform a computation
❑ Don’t create channels for them early in design
ith row
(0) (7)
(1) (2) (3) (4) (5) (6)
0 1 2 3 4 5 6 7
◼ Goals:
❑ Improve performance (cost of task creation + communication)
❑ Maintain scalability of program
❑ Simplify programming
Reduce number of
sends and receives
◼ Conflicting goals:
❑ Maximize processor utilization – place tasks on different
processing units to increase parallelism
❑ Minimize inter-processor communication – place tasks that
communicate frequently on the same processing units to increase
locality
12 x 6 Grid Problem
decompose
2. Communication
◼ Drawbacks:
❑ Dependence analysis is difficult for pointer-based computations or
indirect addressing
❑ Execution time of function calls or loops with unknown bounds is
difficult to predict at compile time
◼ Advantages:
❑ New language constructs are not necessary to enable a parallel
execution
◼ Challenge:
❑ Extract the parallelism at the right level of recursion
◼ Implementation:
❑ Processes, threads, and any paradigm that makes use of these
concepts
for i ← 0 to n-1
for j ← 0 to n-1
c[i, j] ← 0
for k ← 0 to n-1
c[i, j] ← c[i, j] + a[i, k] x b[k, j]
◼ Master task:
❑ Generally responsible for coordination and perform initializations,
timings, and output operations
◼ Worker task:
❑ Wait for instruction from master task
// Receives data
worker_receive_data(&b, row_a_buffer);
// Performs computations
worker_compute(b, row_a_buffer, result);
◼ Advantages:
❑ Useful for adaptive and irregular applications
◼ Tasks can be generated dynamically
❑ Overhead for thread creation is independent of the problem size and the number
of tasks
◼ Disadvantages:
❑ For fine-grained tasks, the overhead of retrieval and insertion of tasks becomes
important
[ CS3210 - AY24/25S1 - L04 ]
66
Example: Java Thread Pool Executor
class ThreadPoolExample { 5 threads
void consume() {
synchronized (buffer) {
while (buffer is empty)
buffer.wait();
Retrieve an item from buffer;
if (buffer was full)
buffer.notify();
}
}
[ CS3210 - AY24/25S1 - L04 ]
69
Pipelining
◼ Data in the application is partitioned into a stream of data
elements that flows through the pipeline stages one after the
other to perform different processing steps
❑ A form of functional parallelism: Stream parallelism
Program … Program
T1 T2 Tp
A B
initialize
while (more data) {
receive data element from previous stage
perform operation on data element
send data element to next stage
}
finalize
[ CS3210 - AY24/25S1 - L04 ]
70
Summary
◼ Models of Communication