0% found this document useful (0 votes)

35 views72 pages

L04 Parallel Programming Models I

The document discusses parallel programming models, focusing on types of parallelism, including data and task parallelism, and their implications for program design. It outlines the importance of coordination models such as shared address space, data parallel, and message passing, along with the overheads associated with parallelism. Additionally, it presents Foster's Design Methodology for program parallelization, emphasizing partitioning, communication, agglomeration, and mapping tasks to processors.

Uploaded by

Vishnu Prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views72 pages

L04 Parallel Programming Models I

Uploaded by

Vishnu Prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Parallel Programming Models –

Part I
Lecture 04
Outline
◼ Parallelism and Types of Parallelism
◼ Parallel Programming Models
❑ Models of Coordination
❑ Program Parallelization
❑ Parallel Programming Patterns
◼ Summary

[ CS3210 - AY24/25S1 - L04 ]

2
PARALLELISM

[ CS3210 - AY24/25S1 - L04 ]

3
What is Parallelism?
◼ Parallelism:
❑ Average number of units of work that can be performed in parallel
per unit time
❑ Example: MIPS, MFLOPS, average number of threads (processes)
per second
◼ Limits in exploiting parallelism
❑ Program dependencies – data dependencies, control dependencies
❑ Runtime – memory contention, communication overheads,
thread/process overhead, synchronization (coordination)
◼ Work = tasks + dependencies

[ CS3210 - AY24/25S1 - L04 ]

4
Types of Parallelism
Data parallelism

• Partition the data used in solving the problem

among the processing units; each processing unit
carries out similar operations on its part of the data

Task parallelism

• Partition the tasks in solving the problem among

the processing units

[ CS3210 - AY24/25S1 - L04 ]

5
Data Parallelism
◼ Same operation is applied to different elements of a data
set
❑ If operations are independent, elements can be distributed among
cores for parallel execution ➔ data parallelism

◼ SIMD computers / instructions are designed to exploit data

parallelism

◼ Example:
for (i = 0; i < N; i++)
a[i] = b[i-1] + c[i]
[ CS3210 - AY24/25S1 - L04 ]
6
Loop Parallelism – aka Data Parallelism
◼ Many algorithms perform computations by iteratively
traversing a large data structure
❑ Commonly expressed as a loop

◼ If the iterations are independent:

❑ Iterations can be executed in arbitrary order and in parallel on
different cores

[ CS3210 - AY24/25S1 - L04 ]

7
Example: Parallel For in OpenMP
◼ Iterations of the for loop executed in parallel by a group threads
◼ Using OpenMP (Open Multi-Processing): application programming
interface (API), multi-platform shared-memory multiprocessing
programming
// Parallelize the matrix multiplication (result = a x b)
// Each thread will work on one iteration of the outer-most loop
// Variables are shared among threads (a, b, result)
// and each thread has its own private copy (i, j, k)

#pragma omp parallel for num_threads(8)

shared(a, b, result) private (i, j, k)

for (i = 0; i < size; i++)

for (j = 0; j < size; j++)
for (k = 0; k < size; k++)
result.element[i][j] += a.element[i][k] *
b.element[k][j];
[ CS3210 - AY24/25S1 - L04 ]
8
Data Parallelism on MIMD
◼ Common model: SPMD (Single Program Multiple Data)
❑ One parallel program is executed by all cores in parallel (both shared
and distributed address space)
◼ Example: Scalar product of x∙y on p processing units

Same program
executed by p
processing units.

"me" is the
processing units
index (0 to p-1)

[ CS3210 - AY24/25S1 - L04 ]

9
Task (Functional) Parallelism
◼ Independent program parts (tasks) can be executed in
parallel
❑ task (functional) parallelism

◼ Tasks: single statement, series of statements, loops or

function calls

◼ Further decomposition:
❑ A single task can be executed sequentially by one processing units,
or in parallel by multiple processing units

[ CS3210 - AY24/25S1 - L04 ]

10
Example: Task Parallelism
◼ Consider the database query:
Model =“civic” AND Year = “2001” AND
(Color = “green” OR Color = “white”)

[ CS3210 - AY24/25S1 - L04 ] Taken from "Introduction to Parallel Computing" [1] 11

Example: Decomposition A

[ CS3210 - AY24/25S1 - L04 ]

12
Example: Decomposition B

[ CS3210 - AY24/25S1 - L04 ]

13
Task Dependence Graph
◼ Can be used to visualize and evaluate the task decomposition
strategy
◼ A directed acyclic graph:
❑ Node: Represent each task, node value is the expected execution
time
❑ Edge: Represents control dependency between tasks

◼ Properties:
❑ Critical path length: Maximum (slowest) completion time
❑ Degree of concurrency = Total Work / Critical Path Length
◼ An indication of the amount of work that can be done concurrently
[ CS3210 - AY24/25S1 - L04 ]
14
Task Dependence Graph - Example
◼ Decompositions A and B can be visualized as:

Critical Path = (Task 4 → 6 → 7) Critical Path = (Task 1 → 5 → 6 → 7)

Critical Path Length = 27 Critical Path Length = 34
Degree of concurrency = 63 / 27 = 2.33 Degree of concurrency = 64 / 34 = 1.88

[ CS3210 - AY24/25S1 - L04 ]

15
Poll: Data vs Task Parallelism: http://pollev.com/ccris
◼ Suppose we have 60 assignment scripts, each with 15
questions to be distributed to 3 TAs for marking:
Data Data

TA#1 TA#1
TA#3 TA#3

20 scripts questions 1-5

Task TA#2 20 scripts TA#2 questions 11-15 Task

20 scripts questions 6-10

task or data parallel?

[ CS3210 - AY24/25S1 - L04 ]
16
[ CS3210 - AY24/25S1 - L04 ]
17
Representation of Parallelism
Parallelism

Implicit Parallelism Explicit Parallelism

Functional Implicit
Automatic Explicit scheduling
programming scheduling
parallelization
languages OpenMP
Haskell
Implicit mapping Explicit mapping
BSPLib

Programming environments Implicit Communication Explicit Communication

expose different amount of and Synchronization and Synchronization
parallelism to coder Linda
MPI, Pthreads

[ CS3210 - AY24/25S1 - L04 ]

18
MODELS OF COORDINATION

[ CS3210 - AY24/25S1 - L04 ]

19
Overheads of Parallelism
◼ Given enough parallel work, overheads are the biggest barrier
to getting desired speedup (improvement in performance)
❑ cost of starting a parallel task
❑ manage and coordinate large number of inter-processor/task
interactions
◼ Overheads can be in the range of milliseconds (= millions of
flops) on some systems

[ CS3210 - AY24/25S1 - L04 ]

20
Models of Coordination (Communication)
◼ Shared address space
◼ Data parallel
◼ Message passing

[ CS3210 - AY24/25S1 - L04 ]

21
Shared Address Space
◼ Communication abstraction
❑ Tasks communicate by reading/writing from/to shared variables
❑ Ensure mutual exclusion via use of locks
❑ Logical extension of uniprocessor programming
◼ Requires hardware support to implement efficiently
❑ Any processor can load and store from any address – contention
❑ Even with NUMA, costly to scale
◼ Matches shared memory systems – UMA, NUMA, etc.

[ CS3210 - AY24/25S1 - L04 ]

22
Data Parallel
◼ Historically: same operation on each element of an array
❑ SIMD, vector processors
◼ Basic structure: map a function onto a large collection of data
❑ Functional: side-effect-free execution
❑ No communication among distinct function invocations
◼ Allows invocations to be scheduled in parallel
❑ Stream programming model
◼ Modern performance-oriented data-parallel languages do not
enforce this structure any more
❑ CUDA, OpenCL, ISPC

[ CS3210 - AY24/25S1 - L04 ]

23
Message passing
◼ Tasks operate within their own private address spaces
❑ Tasks communicate by explicitly sending/receiving messages
◼ Popular software library: MPI (message passing interface)
◼ Hardware does not implement system-wide loads and stores
❑ Can connect commodity systems together to form large parallel
machine
◼ Matches distributed memory systems
❑ Programming model for clusters, supercomputers, etc.

[ CS3210 - AY24/25S1 - L04 ]

24
Coordination and Hardware
◼ Shared memory space matches shared memory systems –
UMA, NUMA, etc.
◼ Message passing matches distributed memory systems
❑ Programming model for clusters, supercomputers, etc.
◼ But any type of coordination can be implemented in any
hardware

[ CS3210 - AY24/25S1 - L04 ]

25
Correspondence with Hardware Implementations
◼ It is common to implement message-passing abstractions on
shared memory machines (hardware)
❑ “Sending message” means copying data into message library buffers
“Receiving message” means copying data from message library
buffers
◼ It is possible to implement shared address space abstraction
on machines that do not support it in hardware
❑ Less efficient software solutions
❑ Modify a shared variable: send messages to invalidate all (mem)
pages containing the shared variable
❑ Reading a shared variable: page-fault handler issues appropriate
network requests (messages)

[ CS3210 - AY24/25S1 - L04 ]

26
Summary of Coordination Models
◼ Shared address space: very little structure
❑ All threads can read and write to all shared variables
❑ Drawback: not all reads and writes have the same cost (and that cost
is not apparent in program text)
◼ Data-parallel: very rigid computation structure
❑ Programs perform the same function on different data elements in a
collection
◼ Message passing: highly structured communication
❑ All communication occurs in the form of messages

[ CS3210 - AY24/25S1 - L04 ]

27
PROGRAM PARALLELIZATION

[ CS3210 - AY24/25S1 - L04 ]

28
Program Parallelization
◼ Parallelization: transform sequential into parallel computation
❑ Define parallel tasks of the appropriate granularity

◼ Granularity of computation can be:

Fine-Grain
A sequence of instructions
A sequence of statements where each
statement consists of several instructions

A function / method which consists of

several statements
Coarse-Grain
[ CS3210 - AY24/25S1 - L04 ]
29
Foster’s Design Methodology
1. Partitioning

• First partition a problem into many smaller pieces, or tasks

2. Communication

• Provides data required by the partitioned tasks (cost of parallelism)

3. Agglomeration

• Decrease communication and development costs, while maintaining flexibility

4. Mapping

• Map tasks to processors (cores), with the goals of minimizing total execution time

[ CS3210 - AY24/25S1 - L04 ]

30
Foster’s Methodology

Partitioning
Problem Communication

Mapping Agglomeration

[ CS3210 - AY24/25S1 - L04 ]

31
1. Partitioning
◼ Divide computation and data into independent pieces to
discover maximum parallelism
❑ Different way of thinking about problems – reveals structure in a problem,
and hence opportunities for optimization
Data Centric - Domain decomposition

• Divide data into pieces of approximately equal size

Data parallelism
• Determine how to associate computations with the data

Computation Centric - Functional decomposition

Task parallelism
• Divide computation into pieces (tasks)
• Determine how to associate data with the computations

[ CS3210 - AY24/25S1 - L04 ]

32
Example: Domain Decompositions

24 tasks, each task with 3 grid points

6 tasks, each task with 12 grid points

1 task with 72 grid points

3-D Matrix (common data structure)

[ CS3210 - AY24/25S1 - L04 ]
33
Example: Functional Decomposition

Atmospheric Model

wind velocity

Hydrology Model Ocean

Model

sea
surface
temperature

Land Surface Model

Computer Model of Climate

[ CS3210 - AY24/25S1 - L04 ]

34
Partitioning Rules of Thumb
◼ At least 10x more primitive tasks than cores in target
computer

◼ Minimize redundant computations and redundant data

storage

◼ Primitive tasks roughly of the same size

◼ Number of tasks an increasing function of problem size

[ CS3210 - AY24/25S1 - L04 ]

35
2. Communication (Coordination)
◼ Tasks are intended to execute in parallel
❑ but generally not executing independently
➔ Need to determine data passed among tasks

◼ Local communication
❑ Task needs data from a small number of other tasks (“neighbors”)
❑ Create channels illustrating data flow

◼ Global communication
❑ Significant number of tasks contribute data to perform a computation
❑ Don’t create channels for them early in design

◼ Ideally, distribute and overlap computation and communication

[ CS3210 - AY24/25S1 - L04 ]

36
Local Communication
◼ 2-D Finite Difference Computation
◼ 2-D grid: at time t+1, requires five points (values at time t) to
update each element
𝑡 𝑡 𝑡 𝑡 𝑡
(𝑡+1) 4𝑋𝑖,𝑗 + 𝑋𝑖−1,𝑗 + 𝑋𝑖+1,𝑗 + 𝑋𝑖,𝑗−1 + 𝑋𝑖,𝑗+1
𝑋𝑖,𝑗 =
jth column
8

ith row

[ CS3210 - AY24/25S1 - L04 ]

37
Global Communication
◼ Unoptimized sum N numbers distributed among N (= 8) tasks
need O(N) time S

(0) (7)
(1) (2) (3) (4) (5) (6)

0 1 2 3 4 5 6 7

Centralised Summation Algorithm

◼ Algorithm is:
❑ Centralised – does not distribute computation and communication
❑ Sequential – does not allow overlap of computation and
communication operations

[ CS3210 - AY24/25S1 - L04 ]

38
Communication Rules of Thumb
◼ Communication operations balanced among tasks

◼ Each task communicates with only a small group of neighbors

◼ Tasks can perform communication in parallel

◼ Overlap computation with communication

[ CS3210 - AY24/25S1 - L04 ]

39
3. Agglomeration
◼ Combine tasks into larger tasks
❑ Number of tasks >= number of cores

◼ Goals:
❑ Improve performance (cost of task creation + communication)
❑ Maintain scalability of program
❑ Simplify programming

[ CS3210 - AY24/25S1 - L04 ]

40
Motivation of Agglomeration
◼ Eliminate communication between primitive tasks
agglomerated into consolidated task

◼ Eg. Combine groups of sending and receiving tasks

Reduce number of
sends and receives

[ CS3210 - AY24/25S1 - L04 ]

41
Examples of Agglomeration
◼ Reduce dimension of decomposition from 3 to 2

◼ 3-D decomposition (adjacent tasks are combined)

◼ Divide-and-conquer – sub-tree are coalesced

◼ Tree algorithm – nodes are combined

[ CS3210 - AY24/25S1 - L04 ]

42
Task Granularity: Impact on Communication
2-D 8 x 8 Grid Problem

a. Fine-grain Task Partition

◼ One grid point per task:
◼ ? tasks 8*8 = 64 tasks
64*4*2 = 512 data transfers
◼ ? data transfers (messages)

b. Coarse-grain Task Partition

◼ Each task is a 4 x 4 grid with a total of 16 grid points:

◼ ? tasks 2*2 = 4 tasks

◼ ? data transfers (messages) 4*4*2 = 32 data transfers

[ CS3210 - AY24/25S1 - L04 ]

43
Agglomeration Rules of Thumb
◼ Locality of parallel algorithm has increased

◼ Number of tasks increases with problem size

◼ Number of tasks suitable for likely target systems

◼ Tradeoff between agglomeration and code modifications

costs is reasonable

[ CS3210 - AY24/25S1 - L04 ]

44
4. Mapping
◼ Assignment of tasks to execution units

◼ Conflicting goals:
❑ Maximize processor utilization – place tasks on different
processing units to increase parallelism
❑ Minimize inter-processor communication – place tasks that
communicate frequently on the same processing units to increase
locality

◼ Mapping may be performed by:

❑ OS for centralized multiprocessor
❑ User for distributed memory systems
[ CS3210 - AY24/25S1 - L04 ]
45
Mapping Example
processor 0 processor 1 processor 2

processor 3 processor 4 processor 5

12 x 6 Grid Problem

◼ Same amount of work on each processing units and to minimize off-

processor communications
[ CS3210 - AY24/25S1 - L04 ]
46
Mapping Example

a. Task/Channel Graph b. Mapping on Three Processors

[ CS3210 - AY24/25S1 - L04 ]

47
Mapping Rules of Thumb
◼ Finding optimal mapping is NP hard in general
❑ Must rely on heuristic

◼ Consider designs based on one task per core and multiple

tasks per core

◼ Evaluate static and dynamic task allocation

❑ If dynamic task allocation is chosen, the task allocator should not be
a bottleneck to performance
❑ If static task allocation is chosen, the ratio of tasks to cores is at least
10:1
[ CS3210 - AY24/25S1 - L04 ]
48
Foster’s Design Methodology Sequential
Algorithm
1. Partitioning

• First partition a problem into many smaller pieces, or tasks

decompose
2. Communication

• Provides data required by the partitioned tasks (cost of parallelism)

Tasks
3. Agglomeration schedule

• Decrease communication and development costs, while maintaining flexibility

Processes
or Threads
4. Mapping
map
• Map tasks to processors (cores), with the goals of minimizing total execution time
Physical Cores
[ CS3210 - AY24/25S1 - L04 ]
& Processors 49
Automatic Parallelization
◼ Parallelizing compilers perform decomposition and scheduling

◼ Drawbacks:
❑ Dependence analysis is difﬁcult for pointer-based computations or
indirect addressing
❑ Execution time of function calls or loops with unknown bounds is
difﬁcult to predict at compile time

[ CS3210 - AY24/25S1 - L04 ]

50
Functional Programming Languages
◼ Describe the computations of a program as the evaluation of
mathematical functions without side effects

◼ Advantages:
❑ New language constructs are not necessary to enable a parallel
execution

◼ Challenge:
❑ Extract the parallelism at the right level of recursion

[ CS3210 - AY24/25S1 - L04 ]

51
PARALLEL PROGRAMMING PATTERNS

[ CS3210 - AY24/25S1 - L04 ]

52
Overview
◼ A parallel programming pattern provides a coordination
structure for tasks:
❑ Similar to design pattern from Software Engineering
❑ Not mutually exclusive, use the best match to describe your solution
design
◼ Examples
❑ Fork–Join ❑ Master-Worker (Master-Slave)
❑ Parbegin-Parend ❑ Task pool
❑ SPMD and SIMD ❑ Producer-consumer
❑ Pipelining

[ CS3210 - AY24/25S1 - L04 ]

53
Fork-Join
◼ Task T creates child tasks
❑ Children run in parallel, but they are independent of each other
❑ The children can execute the same or a different program part, or
function
❑ Children might join the parent at different times

◼ Implementation:
❑ Processes, threads, and any paradigm that makes use of these
concepts

[ CS3210 - AY24/25S1 - L04 ]

54
Example: Database Query (A)
P1 = Fork {
P3 = Fork { return Model = "civic" }
P4 = Fork { return Year = "2001" }
Join P3, P4
Return P3 AND P4
}
P2 = Fork {
P5 = Fork { return Color = "green" }
P6 = Fork { return Color = "white" }
Join P5, P6
Return P5 OR P6
}
Join P1, P2
Return P1 AND P2

[ CS3210 - AY24/25S1 - L04 ]

55
Parbegin–Parend
◼ Programmer specifies a sequence of statements (function calls) to
be executed by a set of cores in parallel
❑ When an executing thread reaches a parbegin–parend construct, a set
of threads is created and the statements of the construct are assigned to
these threads for execution
❑ Usually, the threads execute the same code (function)
◼ The statements following the parbegin–parend construct are
only executed after all these threads have ﬁnished their work
◼ Like a fork-join pattern, where all forks are done at the same time,
and all joins are done at the same time
◼ Implementation:
❑ A language construct such as OpenMP or compiler directives
[ CS3210 - AY24/25S1 - L04 ]
56
Matrix Multiplication

for i ← 0 to n-1
for j ← 0 to n-1
c[i, j] ← 0
for k ← 0 to n-1
c[i, j] ← c[i, j] + a[i, k] x b[k, j]

[ CS3210 - AY24/25S1 - L04 ]

57
Example: Parallel For in OpenMP
◼ Iterations of the for loop executed in parallel by a group
threads
// Parallelize the matrix multiplication (result = a x b)
// Each thread will work on one iteration of the outer-most loop
// Variables are shared among threads (a, b, result)
// and each thread has its own private copy (i, j, k)

#pragma omp parallel for shared(a, b, result)

private (i, j, k)

for (i = 0; i < size; i++)

for (j = 0; j < size; j++)
for (k = 0; k < size; k++)
result.element[i][j] += a.element[i][k] *
b.element[k][j];

[ CS3210 - AY24/25S1 - L04 ]

58
SIMD
◼ Single instructions are executed synchronously by the
different threads on different data
❑ Similar to parbegin-parend, but the threads execute synchronously
(all threads execute the same instruction at the same time)
◼ Implementation:
❑ AVX/SSE Instruction on Intel processor

xmm registers are 128 bits long

SSE instruction treats the xmm registers

as 4 individual 32-bit floating point value

[ CS3210 - AY24/25S1 - L04 ]

59
SPMD
◼ Same program executed on different cores but operate on
different data
❑ Different threads may execute different parts of the parallel program
because of
◼ Different speeds of the executing cores
◼ Control statement in the program, e.g., If statement
❑ Similar to parbegin-parend, but SPMD is the preferred name when
we do not follow the pattern
◼ No implicit synchronization
❑ Synchronization can be achieved by explicit synchronization operations
◼ Implementation:
❑ Programs running on GPGPU
[ CS3210 - AY24/25S1 - L04 ]
60
Master–Worker (previously, Master–Slave)
◼ A single program (master) controls the execution of the
program
❑ Master executes the main function
❑ Assigns work to worker threads

◼ Master task:
❑ Generally responsible for coordination and perform initializations,
timings, and output operations
◼ Worker task:
❑ Wait for instruction from master task

[ CS3210 - AY24/25S1 - L04 ]

61
Matrix Multiplication – Master-Worker
int main(int argc, char ** argv)
{
int nprocs;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
size = 2048;
// One master (rank = 0) and nprocs-1 workers
if (myid == 0) {
master();
} else {
worker();
}
MPI_Finalize();
return 0;
}

[ CS3210 - AY24/25S1 - L04 ]

62
Matrix Multiplication – Master-Worker
void master()
{
matrix a, b, result;

// Allocate memory for matrices

allocate_matrix(&a);
allocate_matrix(&b);
allocate_matrix(&result);

// Initialize matrix elements

init_matrix(a);
init_matrix(b);

// Distribute data to workers

master_distribute(a, b);

// Gather results from workers

master_receive_result(result);

// Print the result matrix

print_matrix(result);
}
[ CS3210 - AY24/25S1 - L04 ]
63
Matrix Multiplication – Master-Worker
void worker()
{
int rows_per_worker = size / workers ;
float row_a_buffer[rows_per_worker][size];
matrix b;
float result[rows_per_worker][size];

// Receives data
worker_receive_data(&b, row_a_buffer);

// Performs computations
worker_compute(b, row_a_buffer, result);

// Sends the results to master

worker_send_result(result);

[ CS3210 - AY24/25S1 - L04 ]

64
Task (Work) Pools
◼ A common data structure from which threads can access to retrieve
tasks for execution

◼ Number of threads is fixed

❑ Threads are created statically by the main thread
❑ Once a task is finished, the worker thread retrieves another task from the pool
❑ Work is not pre-allocated to the worker threads; instead, a new task is retrieved
from the pool by the worker thread
◼ During the processing of a task, a thread can generate new tasks and
insert them into the task pool

[ CS3210 - AY24/25S1 - L04 ]

65
Task (Work) Pools
◼ Access to the task pool must be synchronized to avoid race conditions

◼ Execution of a parallel program is completed when

❑ Task pool is empty
❑ Each thread has terminated the processing of its last task

◼ Advantages:
❑ Useful for adaptive and irregular applications
◼ Tasks can be generated dynamically
❑ Overhead for thread creation is independent of the problem size and the number
of tasks

◼ Disadvantages:
❑ For ﬁne-grained tasks, the overhead of retrieval and insertion of tasks becomes
important
[ CS3210 - AY24/25S1 - L04 ]
66
Example: Java Thread Pool Executor
class ThreadPoolExample { 5 threads

public static void main(String[] args) {

ExecutorService executor =
Executors.newFixedThreadPool(5);

for (int i = 0; i < 10; i++) {

Runnable Task = new Task( ..... );
executor.execute( Task );
}
...... 10 tasks added to
} the pool.
}

◼ The executor will assign task to the 5 threads:

❑ After a thread finishes its task, another task from the pool will be assigned

[ CS3210 - AY24/25S1 - L04 ]

67
Producer–Consumer
◼ Producer threads produce data which are used as input by
consumer threads

◼ Synchronization has to be used to ensure correct coordination

between producer and consumer threads
[ CS3210 - AY24/25S1 - L04 ]
68
Producer–Consumer: Shared Buffers
void produce() {
synchronized (buffer) {
while (buffer is full)
buffer.wait();
Store an item to buffer;
if (buffer was empty)
buffer.notify();
}
}

void consume() {
synchronized (buffer) {
while (buffer is empty)
buffer.wait();
Retrieve an item from buffer;
if (buffer was full)
buffer.notify();
}
}
[ CS3210 - AY24/25S1 - L04 ]
69
Pipelining
◼ Data in the application is partitioned into a stream of data
elements that flows through the pipeline stages one after the
other to perform different processing steps
❑ A form of functional parallelism: Stream parallelism

Program … Program
T1 T2 Tp
A B

initialize
while (more data) {
receive data element from previous stage
perform operation on data element
send data element to next stage
}
finalize
[ CS3210 - AY24/25S1 - L04 ]
70
Summary
◼ Models of Communication

◼ Types and representation of parallelism

◼ Foster’s methodology for program parallelization

◼ Main parallel programming patterns

[ CS3210 - AY24/25S1 - L04 ]

72
References
◼ Introduction to Parallel Computing
❑ by Grama, Gupta, Karypis, Kumar
❑ http://www-users.cs.umn.edu/~karypis/parbook/

[ CS3210 - AY24/25S1 - L04 ]

Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
CLO 3DMarvelous Designer Manual
75% (16)
CLO 3DMarvelous Designer Manual
405 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
High Performance Computing (HPC) - Lec2
No ratings yet
High Performance Computing (HPC) - Lec2
53 pages
Module 5
No ratings yet
Module 5
40 pages
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
100% (1)
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
38 pages
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
CS8261 C Programming Lab Record Manual
100% (1)
CS8261 C Programming Lab Record Manual
59 pages
Chapter 2: Program and Network Properties
No ratings yet
Chapter 2: Program and Network Properties
94 pages
L01 Introduction
No ratings yet
L01 Introduction
51 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Intro HPC IITK
No ratings yet
Intro HPC IITK
44 pages
L03 Architecture Memory
No ratings yet
L03 Architecture Memory
56 pages
Partitioning
No ratings yet
Partitioning
37 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Module 1 Chapter2
No ratings yet
Module 1 Chapter2
100 pages
LP V Theory and Practical Explanation: o o o o
No ratings yet
LP V Theory and Practical Explanation: o o o o
96 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Parallel Programming Models: Sathish Vadhiyar
No ratings yet
Parallel Programming Models: Sathish Vadhiyar
32 pages
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
42 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
ACA Chapter2
No ratings yet
ACA Chapter2
66 pages
Lecture 6
No ratings yet
Lecture 6
37 pages
5-Parallel Algorithm Design Life Cycle
No ratings yet
5-Parallel Algorithm Design Life Cycle
25 pages
Lecture4 PDF
No ratings yet
Lecture4 PDF
23 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Mod5 - Aca 1 52
No ratings yet
Mod5 - Aca 1 52
52 pages
Unit 4
No ratings yet
Unit 4
42 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
No ratings yet
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
27 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
Programming Models
No ratings yet
Programming Models
21 pages
Cray-1 (1976) : The World's Most Expensive Love Seat
No ratings yet
Cray-1 (1976) : The World's Most Expensive Love Seat
18 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
PA Midsem
No ratings yet
PA Midsem
20 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
NAT Reviewer
No ratings yet
NAT Reviewer
75 pages
Parallel Programming
No ratings yet
Parallel Programming
18 pages
HPC Module 4
No ratings yet
HPC Module 4
18 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Introduction
No ratings yet
Introduction
17 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
No ratings yet
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
24 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
02 - Introduction To Concurrent Systems PDF
No ratings yet
02 - Introduction To Concurrent Systems PDF
31 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
A History of Mobile Apps
100% (2)
A History of Mobile Apps
68 pages
IT105 Midterm Lecture Part1
No ratings yet
IT105 Midterm Lecture Part1
5 pages
Unit 3 - Part 1 Assignment Problem
No ratings yet
Unit 3 - Part 1 Assignment Problem
54 pages
AFT Impulse 8 Data Sheet
No ratings yet
AFT Impulse 8 Data Sheet
2 pages
Business Continuity Specialist Exam
No ratings yet
Business Continuity Specialist Exam
45 pages
Datasheet AVEVA MES
No ratings yet
Datasheet AVEVA MES
9 pages
Dumpsys ANR WindowManager
No ratings yet
Dumpsys ANR WindowManager
1,707 pages
Lead-In 1.1.: Unit 4. Buying A Computer
No ratings yet
Lead-In 1.1.: Unit 4. Buying A Computer
20 pages
Best WiFi Adapter For Kali Linux - Monitor Mode and Packet Injection - Best Kali Linux Tutorials
No ratings yet
Best WiFi Adapter For Kali Linux - Monitor Mode and Packet Injection - Best Kali Linux Tutorials
14 pages
INFORMATION SECURITY - Information Security Unit 1-5
No ratings yet
INFORMATION SECURITY - Information Security Unit 1-5
35 pages
Automatic Free
No ratings yet
Automatic Free
4 pages
Slide 1.pptx-1
No ratings yet
Slide 1.pptx-1
41 pages
Lunch Box Switch - Seven Segment Display (CC and CA) : Lab Activity - 7
No ratings yet
Lunch Box Switch - Seven Segment Display (CC and CA) : Lab Activity - 7
7 pages
MODULE-1 Data at Rest Vs Data in Motion
No ratings yet
MODULE-1 Data at Rest Vs Data in Motion
17 pages
Cs 083 HP 2 Mat Cs Final
No ratings yet
Cs 083 HP 2 Mat Cs Final
10 pages
MSA 4th Edition
No ratings yet
MSA 4th Edition
54 pages
FBC Lab Manual - 4361603-1
No ratings yet
FBC Lab Manual - 4361603-1
48 pages
SMMO 2017-2023 Problems
No ratings yet
SMMO 2017-2023 Problems
32 pages
Wachemo University DEPARTEMENT OF Electrical and Computer Engineering School of Post Graduates
No ratings yet
Wachemo University DEPARTEMENT OF Electrical and Computer Engineering School of Post Graduates
26 pages
User Manual 6385
No ratings yet
User Manual 6385
2 pages
qw5618 EN
No ratings yet
qw5618 EN
21 pages
Zkfi: Privacy-Preserving and Regulation Compliant Transactions Using Zero Knowledge Proofs
No ratings yet
Zkfi: Privacy-Preserving and Regulation Compliant Transactions Using Zero Knowledge Proofs
10 pages
Atmel
No ratings yet
Atmel
6 pages
Distributed Fine-Tuning With The Transformers API by HuggingFace - Databricks
No ratings yet
Distributed Fine-Tuning With The Transformers API by HuggingFace - Databricks
7 pages
Quad9 - A Public and Free DNS Service For A Better Security and
No ratings yet
Quad9 - A Public and Free DNS Service For A Better Security and
3 pages
CN Suggesion Ca3
No ratings yet
CN Suggesion Ca3
2 pages
Interactive Map
No ratings yet
Interactive Map
2 pages
Anonymous CV
No ratings yet
Anonymous CV
2 pages
Computer Science, Career and Job
From Everand
Computer Science, Career and Job
Ramkrishna Ghosh
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet