0% found this document useful (0 votes)

8 views104 pages

Module3

Parallel and Distributed Computing 3

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views104 pages

Module3

Parallel and Distributed Computing 3

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 104

Parallel and Distributed Computing

UCS645
Module 3
Saif Nalband
Contents
Parallel Decomposition and Parallel Performance:
Principles of Parallel Algorithm Design: Decomposition Techniques,
Characteristics of Tasks and Interactions, Mapping Techniques for Load
Balancing. Critical Paths, Sources of Overhead in Parallel Programs,
Performance metrics for parallel algorithm implementations,
Performance measurement, The Effect of Granularity on Performance.
Basic Terminology
● Fragment: sequence of executed instructions.
● Two executing fragments may proceed in parallel but in
○ Complete synchrony
○ Complete independently
○ Occasional synchrony
● Task: sequence execution or more accurately instructions retired
sequentially
● Granularity: the number of steps in a task relative to those in the
complete parallel program.
Basics
● Fine grained :a decomposition into larger number of small task
● Coarse grained: decomposition into small number of large task
● Coarse grained task are relatively longer; fine grained task are shorter
Computational model
Efficiency: 2 metric
● Asymptotic analysis
● Concrete: How well does the algorithm implementation behave on
available h/w and data size

● Asymptotic analysis ? Oversimplifies several complex dynamics (like

cache, out of order behavior execution on multiple engines,
instruction dependencies.

● Abstract part of analysis : will employ big ) notation to describe the

number of steps an algorithm takes, function of input size n and
number of processors p.
SIMPLE Parallel Model
RAM: sequential model, takes unit step
● A parallel system consist of p sequential processor , p is variable and
may choose function of n.
● Each processor has access to an unbounded number of constants –
sized local memory location, which are not assessable to other
processor.
● Each processor can read from or write to any local memory location
in unit time
● Communicating a constant-sized message from processor I to
processor j takes unit time
● Each processor takes unit time to perform simple arithmetic and
logical operations on constant sized operands.
Shortcoming of SIM RAM Model
● Time taken by the network in message transmission is not modeled.
The cost of synchronization is ignored.
● Instead it assumes that if a message addressed to processor I is
send by some other processor, it arrives instantaneously and
processor I spends one time unit reading.
● Only time spend in reading is counted.
● works in distributed memory model
Bulk Synchronous Parallel( BSP) Model
It provides a structured approach to parallel computation by dividing
execution into supersteps, each followed by a synchronization barrier.
Key Components of BSP
● Processors (or Workers)

○ Each processor performs computations independently on local data.

● Communication Network
○ Processors exchange messages during computation.
● Barrier Synchronization
○ After each superstep, all processors synchronize before proceeding.
BSP Execution Model
A BSP computation consists of a sequence of supersteps, each with three
phases:
1. Computation Phase
1. Each processor performs local computations on its data.
2. Can send/receive messages to/from other processors.
2. Communication Phase
1. Messages are exchanged between processors.
2. Messages sent in superstep S are available only in superstep S+1.
3. Barrier Synchronization
1. All processors wait until every other processor finishes its computation and
communication.
2. Ensures no processor moves ahead before others complete the superstep.
BSP Cost Model Formula
The total time for a BSP computation with S supersteps is:
For a single superstep, the time is calculated as:
Example: Dot Product of Two Vectors
● Problem: Compute the dot product of two vectors a and b of
size N using p processors.
Parallel RAM (PRAM)
● The Parallel Random Access Machine (PRAM) model is a theoretical
framework for designing and analyzing parallel algorithms.
● It generalizes the Random Access Machine (RAM) model by introducing
multiple processors that can synchronously access a shared memory.
Below is an explanation of the PRAM mode
Key Features of PRAM
1. Processors: The PRAM consists of an unbounded collection of
processors (P0,P1,P2,…P0,P1,P2,…).Each processor has its own
local memory and knows its unique index.
2. Shared Memory: All processors can access a global shared memory in
unit time. Inputs and outputs are stored in shared memory cells.
3. Execution Model: Each processor executes instructions synchronously
in three phases:
■ Read: Read data from shared memory.
■ Compute: Perform local computations.
■ Write: Write results to shared memory.
4. Idealization: The PRAM assumes unlimited processors and ignores
practical concerns like communication delays, synchronization
overhead, or memory access latency.
Memory Access Conflict Models
To address simultaneous access to the same memory location, PRAM
defines four submodels:
● Exclusive Read Exclusive Write (EREW): No two processors can read or
write to the same memory cell simultaneously.
● Concurrent Read Exclusive Write (CREW): Multiple processors can read
the same memory cell simultaneously, but only one processor can write
at a time.
● Exclusive Read Concurrent Write (ERCW): Multiple processors can write
to the same memory cell simultaneously, but no two processors can read
it at the same time.
Concurrent Read Concurrent Write (CRCW):
• Both concurrent reads and writes are allowed
• Variants include:
• Common: All processors writing must write the same value.
• Arbitrary: One processor’s value is arbitrarily chosen.
• Priority: The processor with the lowest index writes its value
Example: Dot product of two array
● The dot product of two arrays in the PRAM model leverages parallel
computation and shared memory to achieve efficient results. Below is a
step-by-step example using the EREW PRAM model (Exclusive
Read/Exclusive Write), where no two processors access the same
memory cell simultaneously.
● Problem Setup
BSP VS PRAM
Creating a parallel program
Creating a parallel program
▪ Your thought process:
1.Identify work that can be performed in parallel
2.Partition work (and also data associated with the work)
3.Manage data access, communication, and synchronization

▪ A common goal is maximizing speedup *

For a fixed computation:

Time (1 processor)
Speedup( P processors ) =
Time (P processors)

* Other goals include achieving high efficiency (cost, area, power, etc.) or
working on bigger problems than can fit on one machine
Problem decomposition
▪ Break up problem into tasks that can be carried out in parallel
▪ In general: create at least enough tasks to keep all execution
units on a machine busy

Key challenge of
decomposition:
identifying
dependencies
(or... a lack of dependencies)
Amdahl’s Law: dependencies limit maximum
speedup due to parallelism

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently

sequential (dependencies prevent parallel execution)

▪ Then maximum speedup due to parallel execution ≤ 1/S

A simple example
▪ Consider a two-step computation on a N x N image
- Step 1: multiply brightness of all pixels by two
(independent computation on each pixel)
- Step 2: compute average of all pixel values

▪ Sequential implementation of program

- Both steps take ~ N2 time, so total time is ~ 2N2
N
Paralleli
sm

N
N2 N2
1

Execution
time
First attempt at parallelism (P processors)
▪ Strategy:
- Step 1: execute in parallel
P
- time for phase 1: N2/P

Parallelism
Sequential program
- Step 2: execute serially
- time for phase 2: N2
N N
1 2 2

▪ Overall performance: Execution

time
Speedup
N2/
P

Parallelism
P Parallel program
Speedup ≤ 2
N
1 2

Execution time
Parallelizing step 2
▪ Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results
serially
- time for phase 2: N2/P + P

▪ Overall performance:
- Speedup Overhead of parallel
N2/P N2/PP algorithm: combining the
P partial sums
Parallelism Parallel program

Note: speedup → P when N >> P

1
Execution time
Amdahl’s Law: dependencies limit maximum
speedup due to parallelism

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently

sequential (dependencies prevent parallel execution)

▪ Then maximum speedup due to parallel execution ≤ 1/S

Amdahl’s law
▪ Let S = the fraction of total work that is inherently
sequential
▪ Max speedup on P processors given by:
speedup
Max Speedup S=0.01

S=0.05

S=0.1

Num Processors
Decomposition
▪ Who is responsible for decomposing a program into independent
tasks?
- In most cases: the programmer

▪ Automatic decomposition of sequential programs continues to be a

challenging research problem (very difficult in the general case)
- What if dependencies are data dependent (not known at compile time)?
- Compiler must analyze program, identify dependencies
- Researchers have had modest success with simple loop nests
- The “magic parallelizing compiler” for complex, general-purpose code has not
yet been achieved
Assignment
▪ Assigning tasks to workers
- Think of “tasks” as things to do
- What are “workers”? (Might be threads, program instances, vector
lanes, etc.)

▪ Goals: achieve good workload balance, reduce communication costs

▪ Can be performed statically (before application is run), or

dynamically as program executes

▪ Although programmer is often responsible for decomposition, many

languages/runtimes take responsibility for assignment.
Link
Orchestration
▪ Involves:
- Structuring communication
- Adding synchronization to preserve dependencies if necessary
- Organizing data structures in memory
- Scheduling tasks

▪ Goals: reduce costs of communication/sync, preserve locality of

data reference, reduce overhead, etc.

▪ Machine details impact many of these decisions

- If synchronization is expensive, programmer might use it more sparsely
Mapping to hardware
▪ Mapping “threads” (“workers”) to hardware execution units
▪ Example 1: mapping by the operating system
- e.g., map a thread to HW execution context on a CPU core

▪ Example 2: mapping by the compiler

- Map ISPC program instances to vector instruction lanes

▪ Example 3: mapping by the hardware

- Map CUDA thread blocks to GPU cores (discussed in a future lecture)

▪ Many interesting mapping decisions:

- Place related threads (cooperating threads) on the
same core (maximize locality, data sharing, minimize
costs of comm/sync)
- Place unrelated threads on the same core (one might be bandwidth limited and another might
be compute limited) to use machine more efficiently
Creating a parallel program
● The first step in developing a parallel algorithm is to decompose the
problem into tasks that can be executed concurrently.
● A given problem may be decomposed into tasks in many different
ways.
● Tasks may be of same, different, or even interminate sizes.
● A decomposition can be illustrated in the form of a directed graph
with nodes corresponding to tasks and edges indicating that the
result of one task is required for processing the next. Such a graph is
called a task dependency graph
Task Dependency Graph
Example: Multiplying a Dense Matrix with a Vector
Example: Database Query Processing
Different task decompositions may lead to significant differences with respect to their eventual parallel performance.
Granularity of Task Decompositions
● The number of tasks into which a problem is decomposed
determines its granularity.
● Decomposition into a large number of tasks results in fine grained
decomposition and that into a small number of tasks results in a
coarse grained decomposition
Abstraction of the task graph of query
Degree of Concurrency
● The number of tasks that can be executed in parallel is the degree of
concurrency of a decomposition.
● Since the number of tasks that can be executed in parallel may
change over program execution, the maximum degree of concurrency
is the maximum number of such tasks at any point during execution.
● What is the maximum degree of concurrency of the database
query examples?
● The average degree of concurrency is the average number of tasks
that can be processed in parallel over the execution of the program.
Assuming that each tasks in the database example takes identical
processing time, what is the average degree of concurrency in each
decomposition?
● The degree of concurrency increases as the decomposition becomes
finer in granularity and vice versa
What are the critical path lengths for the two task dependency graphs? If each
task takes 10 time units, what is the shortest parallel execution time for each
decomposition? How many processors are needed in each case to achieve
this minimum parallel execution time? What is the maximum degree of
concurrency?
Critical Path Length
● A directed path in the task dependency graph represents a sequence
of tasks that must be processed one after the other.
● The longest such path determines the shortest time in which the
program can be executed in parallel.
● The length of the longest path in a task dependency graph is called
the critical path length
Limits on Parallel Performance
● It would appear that the parallel time can be made arbitrarily small by
making the decomposition finer in granularity.
● There is an inherent bound on how fine the granularity of a
computation can be. For example, in the case of multiplying a dense
matrix with a vector, there can be no more than (n^2) concurrent
tasks.
● Concurrent tasks may also have to exchange data with other tasks.
This results in communication overhead. The tradeoff between the
granularity of a decomposition and associated overheads often
determines performance bounds.
Task Interaction Graph
● Subtasks generally exchange data with others in a decomposition.
For example, even in the trivial decomposition of the dense matrix-
vector product, if the vector is not replicated across all tasks, they
will have to communicate elements of the vector.
● The graph of tasks (nodes) and their interactions/data exchange
(edges) is referred to as a task interaction graph.
● Note that task interaction graphs represent data dependencies,
where as task dependency graphs represent control dependencies.
Task Interaction Graphs: An Example
Task Interaction Graphs, Granularity, and Communication
● In general, if the granularity of a decomposition is finer, the
associated overhead (as a ratio of useful work assocaited with a task)
increases.

y[0] = A[0, 0].b[0] + A[0, 1].b[1] + A[0, 4].b[4] + A[0, 8].b[8].

Creating a parallel program
Processes and Mapping
• In general, the number of tasks in a decomposition exceeds
the number of processing elements available.

• For this reason, a parallel algorithm must also provide a

mapping of tasks to processes.

Note: We refer to the mapping as being from tasks to processes, as opposed

to processors. This is because typical programming APIs, as we shall see, do
not allow easy binding of tasks to physical processors. Rather, we aggregate
tasks into processes and rely on the system to map these processes to physical
processors. We use processes, not in the UNIX sense of a process, rather, simply
as a collection of tasks and associated data.
Processes and Mapping
• Appropriate mapping of tasks to processes is critical to the
parallel performance of an algorithm.

• Mappings are determined by both the task dependency and

task interaction graphs.

• Task dependency graphs can be used to ensure that work

is equally spread across all processes at any point (minimum
idling and optimal load balance).

• Task interaction graphs can be used to make sure that

processes need minimum interaction with other processes
(minimum communication).
Processes and Mapping
An appropriate mapping must minimize parallel execution time by:

• Mapping independent tasks to different processes.

• Assigning tasks on critical path to processes as soon as they become

available.

• Minimizing interaction between processes by mapping tasks

with dense interactions to the same process.

Note: These criteria often conflict eith each other. For example, a
decomposition into one task (or no decomposition at all) minimizes
interaction but does not result in a speedup at all! Can you think of other
such conflicting cases?
Task 4 Task 3 Task 2 Task 1 Task 4 Task 3 Task 2 Task 1

10 10 10 10 10 10 10 10
P3 P2 P1 P0 P3 P2 P1 P0

P0 6 Task 5
P2 9 Task 6 P0 6 Task 5
P0 11 Task 6

P0 8 Task 7 P0 7 Task 7

(a) (b)

Mapping tasks in the database query decomposition to

processes. These mappings were arrived at by viewing the
dependency graph in terms of levels (no two nodes in a level
have dependencies). Tasks within a single level are then assigned
to different processes.
Eg: Task Dependency Graph, Task Interaction
Graph, Processing, and Mapping
1. Task Dependency Graph
• A task dependency graph shows the order in which tasks must be
executed based on their dependencies. It's a directed acyclic
graph (DAG) where:
• Nodes represent tasks
• Edges represent dependencies (task A must complete before task
B can start)
General Rules for Task Dependency Graphs
1. Node Representation:
• Each node should represent a meaningful unit of computation
• Nodes should be roughly equal in computational weight (for load balancing)
2. Edge Representation:
• Directed edges must show true dependencies only
• Edge A→B means "A must complete before B can start”
3. Critical Path:
• Identify the longest path through the graph (determines minimum execution time)
• Focus optimization efforts on critical path tasks
4. Granularity:
• Too fine-grained: High overhead from dependency management
• Too coarse-grained: Poor load balancing and processor utilization
5. Parallelism Potential:
• Independent tasks (no connecting edges) can execute in parallel
• The more independent subgraphs, the more parallelism available
Example: Matrix Multiplication (C = A × B)
• For multiplying two 2x2
matrices:
The computation involves: • The dependency graph would
c11 = a11b11 + a12b21 look like
c12 = a11b12 + a12b22
c21 = a21b11 + a22b21
c22 = a21b12 + a22b22
2. Task Interaction Graph
• A task interaction graph shows which tasks need to communicate or
share data during execution. Unlike the dependency graph, edges here
represent communication requirements rather than execution order.
• For our matrix multiplication example, if we assign:
• Task 1 computes c11
• Task 2 computes c12
• Task 3 computes c21
• Task 4 computes c22
• The interaction graph shows they all need access to elements of A and B:
General Rules for Task Interaction Graphs
1. Node Representation:
• Nodes represent tasks or processes
• Size/weight can indicate computation load
2. Edge Representation:
• Undirected edges show communication requirements
• Edge weight can indicate communication volume/frequency
3.Partitioning Goals:
• Minimize edge cuts between partitions (reduce communication)
• Balance computational load across partitions
• Group highly interacting tasks together
4. Mapping Considerations:
• Tasks with heavy interaction should be mapped to nearby processors
• Consider physical network topology when mapping
5. Communication Patterns:
• Identify broadcast, reduction, or point-to-point patterns
• Optimize for the dominant communication pattern
3. Processing and Mapping
• Processing refers to how tasks are executed on processors, while
mapping is the assignment of tasks to processors.

Example code: eg0.c

How This Example Demonstrates the Concepts:
Task Dependency Graph:
• Each C[i][j] computation is independent and can be computed in
parallel
• No edges between tasks (fully parallelizable)
Task Interaction Graph:
• All tasks need read access to matrices A and B
• Each task writes to different location in C (no write conflicts)
Processing and Mapping:
• 4 tasks mapped to 4 threads (1:1 mapping)
• Each thread computes one element of the result matrix
• No communication needed between threads during computation
Key Differences Between the Graphs:
Feature Task Dependency Graph Task Interaction Graph
Purpose Show execution order Show communication needs
Edge Meaning "Must complete before" "Needs to communicate with"
Used for Scheduling Load balancing, partitioning
Our Example Edges No edges (independent) All connected to A,B
Mapping Techniques in Parallel Computing
• Mapping refers to the process of assigning tasks to processors in a
parallel system. Here are the main techniques:
1. Static Mapping
• Assignment is determined before execution begins
• Best for regular, predictable workloads
Types:
• Block Mapping: Contiguous chunks of data/tasks to processors
• Cyclic Mapping: Tasks assigned to processors in round-robin
fashion
• Block-Cyclic Mapping: Combination of block and cyclic
• Block Mapping: Contiguous chunks of data/tasks to processors

Eg1.c

• Cyclic Mapping: Tasks assigned to processors in round-robin fashion

Eg2.c

• Block-Cyclic Mapping: Combination of block and cyclic

Eg3.c
Comparison Table

Mapping Pros Cons Best Use Cases

Poor load balancing if Regular, uniform
Block Good locality, low overhead
uneven computations
Cyclic Excellent load balancing Poor locality Irregular computations
Balances locality and load Moderately irregular
Block-Cyclic Slightly more complex
balancing workloads
2. Dynamic Mapping
Assignment occurs during runtime

• Better for irregular or unpredictable workloads

Types:

• Centralized Task Queue: Master maintains task queue, workers request tasks

• Distributed Task Queue: Each processor maintains its own task queue

• Work Stealing: Idle processors steal work from busy ones

Decomposition Techniques
• So how does one decompose a task into various subtasks?

• While there is no single recipe that works for all

problems, we present a set of commonly used techniques
that apply to broad classes of problems. These include:
• recursive decomposition

• data decomposition

• exploratory decomposition

• speculative decomposition
Recursive Decomposition: Example
Recursive Decomposition: Example
• The problem of finding the minimum number in a given list (or
indeed any other associative operation such as sum, AND, etc.) can be
fashioned as a divide-and-conquer algorithm. The following algorithm
illustrates this.

• We first start with a simple serial loop for computing the

minimum entry in a given list:
1. procedure RECURSIVE MIN (A, n)
2. begin
3. if (n = 1) then
4. min := A[0];
5. else
6. lmin :=RECURSIVE MIN (A, n/2);
7. rmin :=RECURSIVE MIN (&(A[n/2]), n −n/2);
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE MIN
The code in the previous foil can be decomposed naturally using
a recursive decomposition strategy. We illustrate this with the
following example of finding the minimum number in the set {4,
9, 1, 7, 8, 11, 2, 12}. The task dependency graph associated with
this computation is as follows:

min(1,2)

min(4,1) min(8,2)

min(4,9) min(1,7) min(8,11) min(2,12)

Data Decomposition
• Data decomposition is a fundamental technique in parallel computing
where a large dataset or problem is divided into smaller parts that can be
processed concurrently by different processors or threads.
• The goal is to distribute the computational workload evenly across
available resources to achieve speedup.

• Types of Data Decomposition

• Domain Decomposition: Dividing the data space into subdomains

• Functional Decomposition: Dividing based on operations to be

performed

• Recursive Decomposition: Dividing problems into subproblems

recursively
Example: Parallel Vector Addition in C
• Eg4.c
• Benefits of This Approach
• Load Balancing: Work is evenly distributed among threads
• Memory Locality: Each thread works on contiguous memory locations
• Scalability: Can take advantage of multiple cores/processors
• Efficiency: Reduces communication overhead as each thread has its own
data segment
Metrics for Parallel Algorithm Design
1. Speedup (S)
• Definition: Ratio of sequential execution time to parallel execution
time.
Formula: S = T_sequential / T_parallel

Demo eg5.c
Continue…
2. Efficiency (E)
• Definition: Speedup per processor (how well resources are
utilized).

Formula: E = S / P (where P is number of processors)

Demo :eg6.c
Continue…
3. Scalability
Definition: How performance changes with increasing processors.

Strong Scaling (fixed problem size)

Demo : eg7.c
Continue…
4. Load Balance
Definition: How evenly work is distributed among processors.

Demo : eg8.c
Continue…
5. Communication Overhead
Definition: Time spent transferring data between processes.

MPI Example (Matrix-Vector Multiplication)

Demo : eg19.c
6. Granularity
Definition: Ratio of computation to communication.
Fine-grained: Coarse-grained:

Demo eg9.c
Continue…
7. Overhead
Definition: Extra work beyond sequential algorithm.

Demo: eg10.c
Metric Summary Table

METRIC FORMULA IDEAL VALUE MEASUREMENT METHOD

Speedup T_seq/T_par P (linear) Time both versions

Efficiency S/P 100% Speedup/processors

Scalability - Maintain efficiency Vary processors

Load Balance - Equal work/time Measure per-thread work

Comm. Overhead T_comm/T_total Near 0% Isolate communication

Granularity T_comp/T_sync High ratio Profile computation vs sync

Overhead (T_par*P-T_seq)/T_seq Near 0% Compare scaled times

Problems : 1: Speedup & Efficiency
You parallelize a matrix multiplication algorithm (C = A × B) using
OpenMP. The sequential version takes 120 sec for N=2000. When
running on 4 cores, the parallel version takes 40 sec.

Calculate the speedup and efficiency.

If the efficiency drops to 60% when using 8 cores, what is the new
parallel runtime?
Solution
• Speedup (S) = T_seq / T_par = 120 / 40 = 3
• Efficiency (E) = S / P = 3 / 4 = 0.75 (75%)
• If efficiency is 60% at 8 cores:
• E = S / P ⇒ 0.6 = S / 8 ⇒ S = 4.8
• S = T_seq / T_par ⇒ 4.8 = 120 / T_par ⇒ T_par = 25 sec
Load Balancing
• An OpenMP parallel loop processes an array of size N=100 with 4
threads. The work per element is uneven:
• Elements 0-49: 1 ms each
• Elements 50-99: 10 ms each

• Compare static (default) and dynamic scheduling:

• What is the worst-case runtime for static scheduling?

• How does dynamic scheduling improve this?

Solution
Static scheduling (default block distribution):
• Thread 0: 0-24 (25 elements × 1 ms = 25 ms)
• Thread 1: 25-49 (25 elements × 1 ms = 25 ms)
• Thread 2: 50-74 (25 elements × 10 ms = 250 ms)
• Thread 3: 75-99 (25 elements × 10 ms = 250 ms)
Worst-case runtime = 250 ms (due to imbalance).
• Dynamic scheduling (chunk size=1):
• Threads grab elements one-by-one.
• Fast threads process more 1 ms tasks; slow threads process fewer 10 ms
tasks.
• Runtime ≈ (50×1 + 50×10)/4 = 137.5 ms (better balance).
Communication Overhead (MPI)
• An MPI program computes a distributed vector dot product. Each
of 4 processes holds 250,000 elements. The computation time
per element is 1 µs, and each process sends 1 KB of data to rank 0
for reduction. The network latency is 50 µs, and bandwidth is 100
MB/s.

• Calculate the total runtime (computation + communication).

• What % of time is spent on communication?

Solution
• Computation time = 250,000 × 1 µs = 0.25 sec
• Communication time:
• Latency: 50 µs
• Transfer time: 1 KB / 100 MB/s = 0.01 ms
• Total per process: 50 µs + 10 µs = 60 µs
• Total runtime ≈ 0.25 sec + 60 µs ≈ 0.25006 sec
• Communication % = (60 µs / 0.25 sec) × 100 ≈ 0.024%
Scalability
• A parallel algorithm shows the following strong scaling results:
Cores Runtime (sec)
1 100
2 55
4 32
8 20

• Calculate speedup and efficiency for each case.

• Is the system scaling well? Justify.

• Calculations:
Cores Speedup (S) Efficiency (E)
1 1.0 100%
2 100/55=1.82 1.82/2=91%
4 100/32=3.13 3.13/4=78%
8 100/20=5.0 5.0/8=62.5%

• Scaling analysis:

• Efficiency drops as cores increase (due to overhead).

• Sub-linear scaling but reasonable up to 4 cores.

Granularity & Overhead
• A task-parallel program processes 10,000 tasks on 8 cores. Each
task takes 50 µs to compute. The scheduler assigns tasks in
chunks of K (varied below). Task scheduling overhead is 10 µs per
chunk.
Chunk Size (K) Avg. Runtime (sec)
1 0.75
10 0.15
100 0.12
• Derive the theoretical runtime formula in terms of K.
• Why does K=100 not improve over K=10 significantly?
• What is the optimal chunk size?
Theoretical runtime:
• Total chunks = 10,000 / K
• Scheduling overhead = (10,000 / K) × 10 µs
• Computation time = 10,000 × 50 µs / 8 cores = 0.0625 sec
• Runtime = max(overhead, computation)
K=100 vs K=10:
• At K=10: Overhead = (10,000/10)×10 µs = 0.01 sec (dominated by
computation).
• At K=100: Overhead = 0.001 sec (negligible).
• Further increasing K provides diminishing returns.
Optimal chunk size:
• Balance overhead and load balancing.
• K=10 is optimal here (low overhead, good balance).
Summary

Question Metric Covered Key Insight

Tradeoff between cores and
1 Speedup/Efficiency
efficiency
Dynamic scheduling improves
2 Load Balancing
balance
3 Communication Overhead Latency vs bandwidth impact
4 Scalability Sub-linear scaling trends
5 Granularity & Overhead Optimal chunk size selection
Resources
• ISPC: Intel® Implicit SPMD Program Compiler Example
• A Grama, A Gupta, G Karypis, V Kumar. Introduction to Parallel
Computing, Addison Wesley (2003). Chapter 3—3.2.2, 3.3 and 3.4
until block-cyclic distribution.

• https://hpc.llnl.gov/documentation/tutorials/introduction-
parallel-computing-tutorial
Thank You

PRAM Algorithms
100% (1)
PRAM Algorithms
24 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Netgear N900 WDNA4100 USB Adapter With The Ralink RT3573 Chip Install For BackTrack 5
No ratings yet
Netgear N900 WDNA4100 USB Adapter With The Ralink RT3573 Chip Install For BackTrack 5
1 page
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Pda 3
No ratings yet
Pda 3
90 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Par Seq Algorithms
No ratings yet
Par Seq Algorithms
44 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Ram, Pram, and Logp Models
No ratings yet
Ram, Pram, and Logp Models
72 pages
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
100% (1)
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
38 pages
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
No ratings yet
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
17 pages
1 Overview, Models of Computation, Brent's Theorem
No ratings yet
1 Overview, Models of Computation, Brent's Theorem
8 pages
Lecture 9 - Parallel Algorithms
No ratings yet
Lecture 9 - Parallel Algorithms
28 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
HPC Note
No ratings yet
HPC Note
39 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
Parallel Random Access Machine
No ratings yet
Parallel Random Access Machine
22 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Week 7
No ratings yet
Week 7
27 pages
L2 Parallel Computing Models
No ratings yet
L2 Parallel Computing Models
31 pages
UNIT-01 What Is Parallel Computing?
No ratings yet
UNIT-01 What Is Parallel Computing?
15 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Week 7
No ratings yet
Week 7
27 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
Daa Unit-V
No ratings yet
Daa Unit-V
50 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
No ratings yet
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
27 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
HPC Unit 456
No ratings yet
HPC Unit 456
25 pages
Pda 1
No ratings yet
Pda 1
72 pages
PDC ch#5
No ratings yet
PDC ch#5
12 pages
PRAM and Distributed Computing Report
No ratings yet
PRAM and Distributed Computing Report
5 pages
Notes 02
No ratings yet
Notes 02
9 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
Lect 02
No ratings yet
Lect 02
51 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Fundamental Algorithms: Chapter 3: Parallel Algorithms - The PRAM Model
No ratings yet
Fundamental Algorithms: Chapter 3: Parallel Algorithms - The PRAM Model
26 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
CS439 CC 2 Parallel Distributed Systems
No ratings yet
CS439 CC 2 Parallel Distributed Systems
37 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
2 ND
No ratings yet
2 ND
19 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
PC 1
No ratings yet
PC 1
53 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
PDC Last Min Notes For MCQS - Theory
No ratings yet
PDC Last Min Notes For MCQS - Theory
39 pages
The Beginner’s Guide to Node.js
From Everand
The Beginner’s Guide to Node.js
Steven Mcananey
No ratings yet
Node.js: Tools & Skills
From Everand
Node.js: Tools & Skills
James Hibbard
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Manglam Paliwal
No ratings yet
Manglam Paliwal
1 page
Shade Rs Tutorial
No ratings yet
Shade Rs Tutorial
35 pages
Application Review Audit Program
No ratings yet
Application Review Audit Program
7 pages
Study Plan For BCS Batch 192 - Feb 2020 PDF
No ratings yet
Study Plan For BCS Batch 192 - Feb 2020 PDF
3 pages
All1 TD-LTE Fixed Wireless Access - Technical Requirements - V1 1
No ratings yet
All1 TD-LTE Fixed Wireless Access - Technical Requirements - V1 1
31 pages
Open Fabric 1200 Launch Slides
No ratings yet
Open Fabric 1200 Launch Slides
10 pages
OMNICOMM Onboard Terminals
No ratings yet
OMNICOMM Onboard Terminals
16 pages
Unit 1: A Very Brief History of The Internet
No ratings yet
Unit 1: A Very Brief History of The Internet
39 pages
Anand Jha Current
No ratings yet
Anand Jha Current
1 page
Mayank Joshi-5.3-Java-Lin-Rakuten-Indore
No ratings yet
Mayank Joshi-5.3-Java-Lin-Rakuten-Indore
3 pages
15me835 PLCM Module 1
No ratings yet
15me835 PLCM Module 1
15 pages
Automation Engineer Basic Getting Started
No ratings yet
Automation Engineer Basic Getting Started
18 pages
RESUME 1688728126781 Priya - Resume
No ratings yet
RESUME 1688728126781 Priya - Resume
3 pages
About Apache
No ratings yet
About Apache
5 pages
E ModbusTCPIP
No ratings yet
E ModbusTCPIP
16 pages
SAP Business Objects - Sample Universe On Microsoft SQL Server
No ratings yet
SAP Business Objects - Sample Universe On Microsoft SQL Server
18 pages
TESDA-OP-CO-01-F11 (Rev - No.00-03/08/17)
No ratings yet
TESDA-OP-CO-01-F11 (Rev - No.00-03/08/17)
11 pages
Mobile Banking and Mobile Financial Services Payment Guide
No ratings yet
Mobile Banking and Mobile Financial Services Payment Guide
10 pages
ABAP Programming Tips
100% (3)
ABAP Programming Tips
4 pages
Head Unit Harman 2019 (26-30)
No ratings yet
Head Unit Harman 2019 (26-30)
5 pages
Tidak penting
No ratings yet
Tidak penting
3 pages
EPLC User Manual Practices Guide
No ratings yet
EPLC User Manual Practices Guide
2 pages
Features You Can Enjoy: Comprehensive & Versatile User-Friendly & Speedy Save Time & Money
No ratings yet
Features You Can Enjoy: Comprehensive & Versatile User-Friendly & Speedy Save Time & Money
2 pages
Raspberry Pi To Propeller Via SPI - Parallax Forums
No ratings yet
Raspberry Pi To Propeller Via SPI - Parallax Forums
20 pages
ArcMap Tutorial
No ratings yet
ArcMap Tutorial
10 pages
Module 4
No ratings yet
Module 4
5 pages
Cellular Modems Technical Documents and Compliance - DRV
No ratings yet
Cellular Modems Technical Documents and Compliance - DRV
6 pages
L TEX For Students, Engineers, and Scientists: Department of Computer Science and Engineering IIT Bombay
No ratings yet
L TEX For Students, Engineers, and Scientists: Department of Computer Science and Engineering IIT Bombay
9 pages
Personal Area Network (PAN) : Networks
No ratings yet
Personal Area Network (PAN) : Networks
3 pages

Module3

Uploaded by

Module3

Uploaded by

Parallel and Distributed Computing

● Asymptotic analysis ? Oversimplifies several complex dynamics (like

● Abstract part of analysis : will employ big ) notation to describe the

○ Each processor performs computations independently on local data.

▪ A common goal is maximizing speedup *

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently

▪ Then maximum speedup due to parallel execution ≤ 1/S

▪ Sequential implementation of program

▪ Overall performance: Execution

Note: speedup → P when N >> P

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently

▪ Then maximum speedup due to parallel execution ≤ 1/S

▪ Automatic decomposition of sequential programs continues to be a

▪ Goals: achieve good workload balance, reduce communication costs

▪ Can be performed statically (before application is run), or

▪ Although programmer is often responsible for decomposition, many

▪ Goals: reduce costs of communication/sync, preserve locality of

▪ Machine details impact many of these decisions

▪ Example 2: mapping by the compiler

▪ Example 3: mapping by the hardware

▪ Many interesting mapping decisions:

y[0] = A[0, 0].b[0] + A[0, 1].b[1] + A[0, 4].b[4] + A[0, 8].b[8].

• For this reason, a parallel algorithm must also provide a

Note: We refer to the mapping as being from tasks to processes, as opposed

• Mappings are determined by both the task dependency and

• Task dependency graphs can be used to ensure that work

• Task interaction graphs can be used to make sure that

• Mapping independent tasks to different processes.

• Assigning tasks on critical path to processes as soon as they become

• Minimizing interaction between processes by mapping tasks

Mapping tasks in the database query decomposition to

Example code: eg0.c

• Cyclic Mapping: Tasks assigned to processors in round-robin fashion

• Block-Cyclic Mapping: Combination of block and cyclic

Mapping Pros Cons Best Use Cases

• Better for irregular or unpredictable workloads

• Work Stealing: Idle processors steal work from busy ones

• While there is no single recipe that works for all

• We first start with a simple serial loop for computing the

min(4,9) min(1,7) min(8,11) min(2,12)

• Types of Data Decomposition

• Functional Decomposition: Dividing based on operations to be

• Recursive Decomposition: Dividing problems into subproblems

Formula: E = S / P (where P is number of processors)

Strong Scaling (fixed problem size)

MPI Example (Matrix-Vector Multiplication)

METRIC FORMULA IDEAL VALUE MEASUREMENT METHOD

Speedup T_seq/T_par P (linear) Time both versions

Efficiency S/P 100% Speedup/processors

Scalability - Maintain efficiency Vary processors

Load Balance - Equal work/time Measure per-thread work

Comm. Overhead T_comm/T_total Near 0% Isolate communication

Granularity T_comp/T_sync High ratio Profile computation vs sync

Overhead (T_par*P-T_seq)/T_seq Near 0% Compare scaled times

Calculate the speedup and efficiency.

• Compare static (default) and dynamic scheduling:

• What is the worst-case runtime for static scheduling?

• How does dynamic scheduling improve this?

• Calculate the total runtime (computation + communication).

• What % of time is spent on communication?

• Calculate speedup and efficiency for each case.

• Is the system scaling well? Justify.

• Efficiency drops as cores increase (due to overhead).

• Sub-linear scaling but reasonable up to 4 cores.

Question Metric Covered Key Insight

You might also like