0% found this document useful (0 votes)
8 views104 pages

Module3

Parallel and Distributed Computing 3

Uploaded by

saif.nalband
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views104 pages

Module3

Parallel and Distributed Computing 3

Uploaded by

saif.nalband
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Parallel and Distributed Computing

UCS645
Module 3
Saif Nalband
Contents
Parallel Decomposition and Parallel Performance:
Principles of Parallel Algorithm Design: Decomposition Techniques,
Characteristics of Tasks and Interactions, Mapping Techniques for Load
Balancing. Critical Paths, Sources of Overhead in Parallel Programs,
Performance metrics for parallel algorithm implementations,
Performance measurement, The Effect of Granularity on Performance.
Basic Terminology
● Fragment: sequence of executed instructions.
● Two executing fragments may proceed in parallel but in
○ Complete synchrony
○ Complete independently
○ Occasional synchrony
● Task: sequence execution or more accurately instructions retired
sequentially
● Granularity: the number of steps in a task relative to those in the
complete parallel program.
Basics
● Fine grained :a decomposition into larger number of small task
● Coarse grained: decomposition into small number of large task
● Coarse grained task are relatively longer; fine grained task are shorter
Computational model
Efficiency: 2 metric
● Asymptotic analysis
● Concrete: How well does the algorithm implementation behave on
available h/w and data size

● Asymptotic analysis ? Oversimplifies several complex dynamics (like


cache, out of order behavior execution on multiple engines,
instruction dependencies.

● Abstract part of analysis : will employ big ) notation to describe the


number of steps an algorithm takes, function of input size n and
number of processors p.
SIMPLE Parallel Model
RAM: sequential model, takes unit step
● A parallel system consist of p sequential processor , p is variable and
may choose function of n.
● Each processor has access to an unbounded number of constants –
sized local memory location, which are not assessable to other
processor.
● Each processor can read from or write to any local memory location
in unit time
● Communicating a constant-sized message from processor I to
processor j takes unit time
● Each processor takes unit time to perform simple arithmetic and
logical operations on constant sized operands.
Shortcoming of SIM RAM Model
● Time taken by the network in message transmission is not modeled.
The cost of synchronization is ignored.
● Instead it assumes that if a message addressed to processor I is
send by some other processor, it arrives instantaneously and
processor I spends one time unit reading.
● Only time spend in reading is counted.
● works in distributed memory model
Bulk Synchronous Parallel( BSP) Model
It provides a structured approach to parallel computation by dividing
execution into supersteps, each followed by a synchronization barrier.
Key Components of BSP
● Processors (or Workers)

○ Each processor performs computations independently on local data.


● Communication Network
○ Processors exchange messages during computation.
● Barrier Synchronization
○ After each superstep, all processors synchronize before proceeding.
BSP Execution Model
A BSP computation consists of a sequence of supersteps, each with three
phases:
1. Computation Phase
1. Each processor performs local computations on its data.
2. Can send/receive messages to/from other processors.
2. Communication Phase
1. Messages are exchanged between processors.
2. Messages sent in superstep S are available only in superstep S+1.
3. Barrier Synchronization
1. All processors wait until every other processor finishes its computation and
communication.
2. Ensures no processor moves ahead before others complete the superstep.
BSP Cost Model Formula
The total time for a BSP computation with S supersteps is:
For a single superstep, the time is calculated as:
Example: Dot Product of Two Vectors
● Problem: Compute the dot product of two vectors a and b of
size N using p processors.
Parallel RAM (PRAM)
● The Parallel Random Access Machine (PRAM) model is a theoretical
framework for designing and analyzing parallel algorithms.
● It generalizes the Random Access Machine (RAM) model by introducing
multiple processors that can synchronously access a shared memory.
Below is an explanation of the PRAM mode
Key Features of PRAM
1. Processors: The PRAM consists of an unbounded collection of
processors (P0,P1,P2,…P0,P1,P2,…).Each processor has its own
local memory and knows its unique index.
2. Shared Memory: All processors can access a global shared memory in
unit time. Inputs and outputs are stored in shared memory cells.
3. Execution Model: Each processor executes instructions synchronously
in three phases:
■ Read: Read data from shared memory.
■ Compute: Perform local computations.
■ Write: Write results to shared memory.
4. Idealization: The PRAM assumes unlimited processors and ignores
practical concerns like communication delays, synchronization
overhead, or memory access latency.
Memory Access Conflict Models
To address simultaneous access to the same memory location, PRAM
defines four submodels:
● Exclusive Read Exclusive Write (EREW): No two processors can read or
write to the same memory cell simultaneously.
● Concurrent Read Exclusive Write (CREW): Multiple processors can read
the same memory cell simultaneously, but only one processor can write
at a time.
● Exclusive Read Concurrent Write (ERCW): Multiple processors can write
to the same memory cell simultaneously, but no two processors can read
it at the same time.
Concurrent Read Concurrent Write (CRCW):
• Both concurrent reads and writes are allowed
• Variants include:
• Common: All processors writing must write the same value.
• Arbitrary: One processor’s value is arbitrarily chosen.
• Priority: The processor with the lowest index writes its value
Example: Dot product of two array
● The dot product of two arrays in the PRAM model leverages parallel
computation and shared memory to achieve efficient results. Below is a
step-by-step example using the EREW PRAM model (Exclusive
Read/Exclusive Write), where no two processors access the same
memory cell simultaneously.
● Problem Setup
BSP VS PRAM
Creating a parallel program
Creating a parallel program
▪ Your thought process:
1.Identify work that can be performed in parallel
2.Partition work (and also data associated with the work)
3.Manage data access, communication, and synchronization

▪ A common goal is maximizing speedup *


For a fixed computation:

Time (1 processor)
Speedup( P processors ) =
Time (P processors)

* Other goals include achieving high efficiency (cost, area, power, etc.) or
working on bigger problems than can fit on one machine
Problem decomposition
▪ Break up problem into tasks that can be carried out in parallel
▪ In general: create at least enough tasks to keep all execution
units on a machine busy

Key challenge of
decomposition:
identifying
dependencies
(or... a lack of dependencies)
Amdahl’s Law: dependencies limit maximum
speedup due to parallelism

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently


sequential (dependencies prevent parallel execution)

▪ Then maximum speedup due to parallel execution ≤ 1/S


A simple example
▪ Consider a two-step computation on a N x N image
- Step 1: multiply brightness of all pixels by two
(independent computation on each pixel)
- Step 2: compute average of all pixel values

▪ Sequential implementation of program


- Both steps take ~ N2 time, so total time is ~ 2N2
N
Paralleli
sm

N
N2 N2
1

Execution
time
First attempt at parallelism (P processors)
▪ Strategy:
- Step 1: execute in parallel
P
- time for phase 1: N2/P

Parallelism
Sequential program
- Step 2: execute serially
- time for phase 2: N2
N N
1 2 2

▪ Overall performance: Execution


time
Speedup
N2/
P

Parallelism
P Parallel program
Speedup ≤ 2
N
1 2

Execution time
Parallelizing step 2
▪ Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results
serially
- time for phase 2: N2/P + P

▪ Overall performance:
- Speedup Overhead of parallel
N2/P N2/PP algorithm: combining the
P partial sums
Parallelism Parallel program

Note: speedup → P when N >> P


1
Execution time
Amdahl’s Law: dependencies limit maximum
speedup due to parallelism

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently


sequential (dependencies prevent parallel execution)

▪ Then maximum speedup due to parallel execution ≤ 1/S


Amdahl’s law
▪ Let S = the fraction of total work that is inherently
sequential
▪ Max speedup on P processors given by:
speedup
Max Speedup S=0.01

S=0.05

S=0.1

Num Processors
Decomposition
▪ Who is responsible for decomposing a program into independent
tasks?
- In most cases: the programmer

▪ Automatic decomposition of sequential programs continues to be a


challenging research problem (very difficult in the general case)
- What if dependencies are data dependent (not known at compile time)?
- Compiler must analyze program, identify dependencies
- Researchers have had modest success with simple loop nests
- The “magic parallelizing compiler” for complex, general-purpose code has not
yet been achieved
Assignment
▪ Assigning tasks to workers
- Think of “tasks” as things to do
- What are “workers”? (Might be threads, program instances, vector
lanes, etc.)

▪ Goals: achieve good workload balance, reduce communication costs

▪ Can be performed statically (before application is run), or


dynamically as program executes

▪ Although programmer is often responsible for decomposition, many


languages/runtimes take responsibility for assignment.
Link
Orchestration
▪ Involves:
- Structuring communication
- Adding synchronization to preserve dependencies if necessary
- Organizing data structures in memory
- Scheduling tasks

▪ Goals: reduce costs of communication/sync, preserve locality of


data reference, reduce overhead, etc.

▪ Machine details impact many of these decisions


- If synchronization is expensive, programmer might use it more sparsely
Mapping to hardware
▪ Mapping “threads” (“workers”) to hardware execution units
▪ Example 1: mapping by the operating system
- e.g., map a thread to HW execution context on a CPU core

▪ Example 2: mapping by the compiler


- Map ISPC program instances to vector instruction lanes

▪ Example 3: mapping by the hardware


- Map CUDA thread blocks to GPU cores (discussed in a future lecture)

▪ Many interesting mapping decisions:


- Place related threads (cooperating threads) on the
same core (maximize locality, data sharing, minimize
costs of comm/sync)
- Place unrelated threads on the same core (one might be bandwidth limited and another might
be compute limited) to use machine more efficiently
Creating a parallel program
● The first step in developing a parallel algorithm is to decompose the
problem into tasks that can be executed concurrently.
● A given problem may be decomposed into tasks in many different
ways.
● Tasks may be of same, different, or even interminate sizes.
● A decomposition can be illustrated in the form of a directed graph
with nodes corresponding to tasks and edges indicating that the
result of one task is required for processing the next. Such a graph is
called a task dependency graph
Task Dependency Graph
Example: Multiplying a Dense Matrix with a Vector
Example: Database Query Processing
Different task decompositions may lead to significant differences with respect to their eventual parallel performance.
Granularity of Task Decompositions
● The number of tasks into which a problem is decomposed
determines its granularity.
● Decomposition into a large number of tasks results in fine grained
decomposition and that into a small number of tasks results in a
coarse grained decomposition
Abstraction of the task graph of query
Degree of Concurrency
● The number of tasks that can be executed in parallel is the degree of
concurrency of a decomposition.
● Since the number of tasks that can be executed in parallel may
change over program execution, the maximum degree of concurrency
is the maximum number of such tasks at any point during execution.
● What is the maximum degree of concurrency of the database
query examples?
● The average degree of concurrency is the average number of tasks
that can be processed in parallel over the execution of the program.
Assuming that each tasks in the database example takes identical
processing time, what is the average degree of concurrency in each
decomposition?
● The degree of concurrency increases as the decomposition becomes
finer in granularity and vice versa
What are the critical path lengths for the two task dependency graphs? If each
task takes 10 time units, what is the shortest parallel execution time for each
decomposition? How many processors are needed in each case to achieve
this minimum parallel execution time? What is the maximum degree of
concurrency?
Critical Path Length
● A directed path in the task dependency graph represents a sequence
of tasks that must be processed one after the other.
● The longest such path determines the shortest time in which the
program can be executed in parallel.
● The length of the longest path in a task dependency graph is called
the critical path length
Limits on Parallel Performance
● It would appear that the parallel time can be made arbitrarily small by
making the decomposition finer in granularity.
● There is an inherent bound on how fine the granularity of a
computation can be. For example, in the case of multiplying a dense
matrix with a vector, there can be no more than (n^2) concurrent
tasks.
● Concurrent tasks may also have to exchange data with other tasks.
This results in communication overhead. The tradeoff between the
granularity of a decomposition and associated overheads often
determines performance bounds.
Task Interaction Graph
● Subtasks generally exchange data with others in a decomposition.
For example, even in the trivial decomposition of the dense matrix-
vector product, if the vector is not replicated across all tasks, they
will have to communicate elements of the vector.
● The graph of tasks (nodes) and their interactions/data exchange
(edges) is referred to as a task interaction graph.
● Note that task interaction graphs represent data dependencies,
where as task dependency graphs represent control dependencies.
Task Interaction Graphs: An Example
Task Interaction Graphs, Granularity, and Communication
● In general, if the granularity of a decomposition is finer, the
associated overhead (as a ratio of useful work assocaited with a task)
increases.

y[0] = A[0, 0].b[0] + A[0, 1].b[1] + A[0, 4].b[4] + A[0, 8].b[8].


Creating a parallel program
Processes and Mapping
• In general, the number of tasks in a decomposition exceeds
the number of processing elements available.

• For this reason, a parallel algorithm must also provide a


mapping of tasks to processes.

Note: We refer to the mapping as being from tasks to processes, as opposed


to processors. This is because typical programming APIs, as we shall see, do
not allow easy binding of tasks to physical processors. Rather, we aggregate
tasks into processes and rely on the system to map these processes to physical
processors. We use processes, not in the UNIX sense of a process, rather, simply
as a collection of tasks and associated data.
Processes and Mapping
• Appropriate mapping of tasks to processes is critical to the
parallel performance of an algorithm.

• Mappings are determined by both the task dependency and


task interaction graphs.

• Task dependency graphs can be used to ensure that work


is equally spread across all processes at any point (minimum
idling and optimal load balance).

• Task interaction graphs can be used to make sure that


processes need minimum interaction with other processes
(minimum communication).
Processes and Mapping
An appropriate mapping must minimize parallel execution time by:

• Mapping independent tasks to different processes.

• Assigning tasks on critical path to processes as soon as they become


available.

• Minimizing interaction between processes by mapping tasks


with dense interactions to the same process.

Note: These criteria often conflict eith each other. For example, a
decomposition into one task (or no decomposition at all) minimizes
interaction but does not result in a speedup at all! Can you think of other
such conflicting cases?
Task 4 Task 3 Task 2 Task 1 Task 4 Task 3 Task 2 Task 1

10 10 10 10 10 10 10 10
P3 P2 P1 P0 P3 P2 P1 P0

P0 6 Task 5
P2 9 Task 6 P0 6 Task 5
P0 11 Task 6

P0 8 Task 7 P0 7 Task 7

(a) (b)

Mapping tasks in the database query decomposition to


processes. These mappings were arrived at by viewing the
dependency graph in terms of levels (no two nodes in a level
have dependencies). Tasks within a single level are then assigned
to different processes.
Eg: Task Dependency Graph, Task Interaction
Graph, Processing, and Mapping
1. Task Dependency Graph
• A task dependency graph shows the order in which tasks must be
executed based on their dependencies. It's a directed acyclic
graph (DAG) where:
• Nodes represent tasks
• Edges represent dependencies (task A must complete before task
B can start)
General Rules for Task Dependency Graphs
1. Node Representation:
• Each node should represent a meaningful unit of computation
• Nodes should be roughly equal in computational weight (for load balancing)
2. Edge Representation:
• Directed edges must show true dependencies only
• Edge A→B means "A must complete before B can start”
3. Critical Path:
• Identify the longest path through the graph (determines minimum execution time)
• Focus optimization efforts on critical path tasks
4. Granularity:
• Too fine-grained: High overhead from dependency management
• Too coarse-grained: Poor load balancing and processor utilization
5. Parallelism Potential:
• Independent tasks (no connecting edges) can execute in parallel
• The more independent subgraphs, the more parallelism available
Example: Matrix Multiplication (C = A × B)
• For multiplying two 2x2
matrices:
The computation involves: • The dependency graph would
c11 = a11b11 + a12b21 look like
c12 = a11b12 + a12b22
c21 = a21b11 + a22b21
c22 = a21b12 + a22b22
2. Task Interaction Graph
• A task interaction graph shows which tasks need to communicate or
share data during execution. Unlike the dependency graph, edges here
represent communication requirements rather than execution order.
• For our matrix multiplication example, if we assign:
• Task 1 computes c11
• Task 2 computes c12
• Task 3 computes c21
• Task 4 computes c22
• The interaction graph shows they all need access to elements of A and B:
General Rules for Task Interaction Graphs
1. Node Representation:
• Nodes represent tasks or processes
• Size/weight can indicate computation load
2. Edge Representation:
• Undirected edges show communication requirements
• Edge weight can indicate communication volume/frequency
3.Partitioning Goals:
• Minimize edge cuts between partitions (reduce communication)
• Balance computational load across partitions
• Group highly interacting tasks together
4. Mapping Considerations:
• Tasks with heavy interaction should be mapped to nearby processors
• Consider physical network topology when mapping
5. Communication Patterns:
• Identify broadcast, reduction, or point-to-point patterns
• Optimize for the dominant communication pattern
3. Processing and Mapping
• Processing refers to how tasks are executed on processors, while
mapping is the assignment of tasks to processors.

Example code: eg0.c


How This Example Demonstrates the Concepts:
Task Dependency Graph:
• Each C[i][j] computation is independent and can be computed in
parallel
• No edges between tasks (fully parallelizable)
Task Interaction Graph:
• All tasks need read access to matrices A and B
• Each task writes to different location in C (no write conflicts)
Processing and Mapping:
• 4 tasks mapped to 4 threads (1:1 mapping)
• Each thread computes one element of the result matrix
• No communication needed between threads during computation
Key Differences Between the Graphs:
Feature Task Dependency Graph Task Interaction Graph
Purpose Show execution order Show communication needs
Edge Meaning "Must complete before" "Needs to communicate with"
Used for Scheduling Load balancing, partitioning
Our Example Edges No edges (independent) All connected to A,B
Mapping Techniques in Parallel Computing
• Mapping refers to the process of assigning tasks to processors in a
parallel system. Here are the main techniques:
1. Static Mapping
• Assignment is determined before execution begins
• Best for regular, predictable workloads
Types:
• Block Mapping: Contiguous chunks of data/tasks to processors
• Cyclic Mapping: Tasks assigned to processors in round-robin
fashion
• Block-Cyclic Mapping: Combination of block and cyclic
• Block Mapping: Contiguous chunks of data/tasks to processors

Eg1.c

• Cyclic Mapping: Tasks assigned to processors in round-robin fashion

Eg2.c

• Block-Cyclic Mapping: Combination of block and cyclic

Eg3.c
Comparison Table

Mapping Pros Cons Best Use Cases


Poor load balancing if Regular, uniform
Block Good locality, low overhead
uneven computations
Cyclic Excellent load balancing Poor locality Irregular computations
Balances locality and load Moderately irregular
Block-Cyclic Slightly more complex
balancing workloads
2. Dynamic Mapping
Assignment occurs during runtime

• Better for irregular or unpredictable workloads

Types:

• Centralized Task Queue: Master maintains task queue, workers request tasks

• Distributed Task Queue: Each processor maintains its own task queue

• Work Stealing: Idle processors steal work from busy ones


Decomposition Techniques
• So how does one decompose a task into various subtasks?

• While there is no single recipe that works for all


problems, we present a set of commonly used techniques
that apply to broad classes of problems. These include:
• recursive decomposition

• data decomposition

• exploratory decomposition

• speculative decomposition
Recursive Decomposition: Example
Recursive Decomposition: Example
• The problem of finding the minimum number in a given list (or
indeed any other associative operation such as sum, AND, etc.) can be
fashioned as a divide-and-conquer algorithm. The following algorithm
illustrates this.

• We first start with a simple serial loop for computing the


minimum entry in a given list:
1. procedure RECURSIVE MIN (A, n)
2. begin
3. if (n = 1) then
4. min := A[0];
5. else
6. lmin :=RECURSIVE MIN (A, n/2);
7. rmin :=RECURSIVE MIN (&(A[n/2]), n −n/2);
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE MIN
The code in the previous foil can be decomposed naturally using
a recursive decomposition strategy. We illustrate this with the
following example of finding the minimum number in the set {4,
9, 1, 7, 8, 11, 2, 12}. The task dependency graph associated with
this computation is as follows:

min(1,2)

min(4,1) min(8,2)

min(4,9) min(1,7) min(8,11) min(2,12)


Data Decomposition
• Data decomposition is a fundamental technique in parallel computing
where a large dataset or problem is divided into smaller parts that can be
processed concurrently by different processors or threads.
• The goal is to distribute the computational workload evenly across
available resources to achieve speedup.

• Types of Data Decomposition


• Domain Decomposition: Dividing the data space into subdomains

• Functional Decomposition: Dividing based on operations to be


performed

• Recursive Decomposition: Dividing problems into subproblems


recursively
Example: Parallel Vector Addition in C
• Eg4.c
• Benefits of This Approach
• Load Balancing: Work is evenly distributed among threads
• Memory Locality: Each thread works on contiguous memory locations
• Scalability: Can take advantage of multiple cores/processors
• Efficiency: Reduces communication overhead as each thread has its own
data segment
Metrics for Parallel Algorithm Design
1. Speedup (S)
• Definition: Ratio of sequential execution time to parallel execution
time.
Formula: S = T_sequential / T_parallel

Demo eg5.c
Continue…
2. Efficiency (E)
• Definition: Speedup per processor (how well resources are
utilized).

Formula: E = S / P (where P is number of processors)

Demo :eg6.c
Continue…
3. Scalability
Definition: How performance changes with increasing processors.

Strong Scaling (fixed problem size)

Demo : eg7.c
Continue…
4. Load Balance
Definition: How evenly work is distributed among processors.

Demo : eg8.c
Continue…
5. Communication Overhead
Definition: Time spent transferring data between processes.

MPI Example (Matrix-Vector Multiplication)

Demo : eg19.c
6. Granularity
Definition: Ratio of computation to communication.
Fine-grained: Coarse-grained:

Demo eg9.c
Continue…
7. Overhead
Definition: Extra work beyond sequential algorithm.

Demo: eg10.c
Metric Summary Table

METRIC FORMULA IDEAL VALUE MEASUREMENT METHOD

Speedup T_seq/T_par P (linear) Time both versions

Efficiency S/P 100% Speedup/processors

Scalability - Maintain efficiency Vary processors

Load Balance - Equal work/time Measure per-thread work

Comm. Overhead T_comm/T_total Near 0% Isolate communication

Granularity T_comp/T_sync High ratio Profile computation vs sync

Overhead (T_par*P-T_seq)/T_seq Near 0% Compare scaled times


Problems : 1: Speedup & Efficiency
You parallelize a matrix multiplication algorithm (C = A × B) using
OpenMP. The sequential version takes 120 sec for N=2000. When
running on 4 cores, the parallel version takes 40 sec.

Calculate the speedup and efficiency.

If the efficiency drops to 60% when using 8 cores, what is the new
parallel runtime?
Solution
• Speedup (S) = T_seq / T_par = 120 / 40 = 3
• Efficiency (E) = S / P = 3 / 4 = 0.75 (75%)
• If efficiency is 60% at 8 cores:
• E = S / P ⇒ 0.6 = S / 8 ⇒ S = 4.8
• S = T_seq / T_par ⇒ 4.8 = 120 / T_par ⇒ T_par = 25 sec
Load Balancing
• An OpenMP parallel loop processes an array of size N=100 with 4
threads. The work per element is uneven:
• Elements 0-49: 1 ms each
• Elements 50-99: 10 ms each

• Compare static (default) and dynamic scheduling:

• What is the worst-case runtime for static scheduling?

• How does dynamic scheduling improve this?


Solution
Static scheduling (default block distribution):
• Thread 0: 0-24 (25 elements × 1 ms = 25 ms)
• Thread 1: 25-49 (25 elements × 1 ms = 25 ms)
• Thread 2: 50-74 (25 elements × 10 ms = 250 ms)
• Thread 3: 75-99 (25 elements × 10 ms = 250 ms)
Worst-case runtime = 250 ms (due to imbalance).
• Dynamic scheduling (chunk size=1):
• Threads grab elements one-by-one.
• Fast threads process more 1 ms tasks; slow threads process fewer 10 ms
tasks.
• Runtime ≈ (50×1 + 50×10)/4 = 137.5 ms (better balance).
Communication Overhead (MPI)
• An MPI program computes a distributed vector dot product. Each
of 4 processes holds 250,000 elements. The computation time
per element is 1 µs, and each process sends 1 KB of data to rank 0
for reduction. The network latency is 50 µs, and bandwidth is 100
MB/s.

• Calculate the total runtime (computation + communication).

• What % of time is spent on communication?


Solution
• Computation time = 250,000 × 1 µs = 0.25 sec
• Communication time:
• Latency: 50 µs
• Transfer time: 1 KB / 100 MB/s = 0.01 ms
• Total per process: 50 µs + 10 µs = 60 µs
• Total runtime ≈ 0.25 sec + 60 µs ≈ 0.25006 sec
• Communication % = (60 µs / 0.25 sec) × 100 ≈ 0.024%
Scalability
• A parallel algorithm shows the following strong scaling results:
Cores Runtime (sec)
1 100
2 55
4 32
8 20

• Calculate speedup and efficiency for each case.

• Is the system scaling well? Justify.


• Calculations:
Cores Speedup (S) Efficiency (E)
1 1.0 100%
2 100/55=1.82 1.82/2=91%
4 100/32=3.13 3.13/4=78%
8 100/20=5.0 5.0/8=62.5%

• Scaling analysis:

• Efficiency drops as cores increase (due to overhead).

• Sub-linear scaling but reasonable up to 4 cores.


Granularity & Overhead
• A task-parallel program processes 10,000 tasks on 8 cores. Each
task takes 50 µs to compute. The scheduler assigns tasks in
chunks of K (varied below). Task scheduling overhead is 10 µs per
chunk.
Chunk Size (K) Avg. Runtime (sec)
1 0.75
10 0.15
100 0.12
• Derive the theoretical runtime formula in terms of K.
• Why does K=100 not improve over K=10 significantly?
• What is the optimal chunk size?
Theoretical runtime:
• Total chunks = 10,000 / K
• Scheduling overhead = (10,000 / K) × 10 µs
• Computation time = 10,000 × 50 µs / 8 cores = 0.0625 sec
• Runtime = max(overhead, computation)
K=100 vs K=10:
• At K=10: Overhead = (10,000/10)×10 µs = 0.01 sec (dominated by
computation).
• At K=100: Overhead = 0.001 sec (negligible).
• Further increasing K provides diminishing returns.
Optimal chunk size:
• Balance overhead and load balancing.
• K=10 is optimal here (low overhead, good balance).
Summary

Question Metric Covered Key Insight


Tradeoff between cores and
1 Speedup/Efficiency
efficiency
Dynamic scheduling improves
2 Load Balancing
balance
3 Communication Overhead Latency vs bandwidth impact
4 Scalability Sub-linear scaling trends
5 Granularity & Overhead Optimal chunk size selection
Resources
• ISPC: Intel® Implicit SPMD Program Compiler Example
• A Grama, A Gupta, G Karypis, V Kumar. Introduction to Parallel
Computing, Addison Wesley (2003). Chapter 3—3.2.2, 3.3 and 3.4
until block-cyclic distribution.

• https://hpc.llnl.gov/documentation/tutorials/introduction-
parallel-computing-tutorial
Thank You

You might also like