0% found this document useful (0 votes)

42 views27 pages

Lecture 4: Principles of Parallel Algorithm Design (Part 4)

This document discusses principles of parallel algorithm design for load balancing. It covers various mapping techniques including static mapping based on data or task graph partitioning, as well as dynamic mapping. The goals of mapping are to reduce interaction time and amount of idle time. Block distributions and cyclic distributions are described for regularly structured data to aid load balancing. Graph partitioning is discussed for irregular data.

Uploaded by

Phani Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views27 pages

Lecture 4: Principles of Parallel Algorithm Design (Part 4)

Uploaded by

Phani Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Lecture 4: Principles of Parallel

Algorithm Design (part 4)

1
Mapping Technique for Load Balancing

• Sources of overheads:
– Inter-process interaction
– Idling
• Goals to achieve:
– To reduce interaction time
– To reduce total amount of time some processes being
idle
– Remark: these two goals often conflict
• Classes of mapping:
– Static
– Dynamic
2
Schemes for Static Mapping

• Mapping Based on Data Partitioning

• Task Graph Partitioning
• Hybrid Strategies

3
Mapping Based on Data Partitioning

• By owner-computes rule, mapping the relevant

data onto processes is equivalent to mapping
tasks onto processes
• Array or Matrices
– Block distributions
– Cyclic and block cyclic distributions
• Irregular Data
– Example: data associated with unstructured mesh
– Graph partitioning

4
1D Block Distribution
Example. Distribute rows or columns of matrix to different
processes

5
Multi-D Block Distribution
Example. Distribute blocks of matrix to different processes

6
Load-Balance for Block Distribution

Example. 𝑛 × 𝑛 dense matrix multiplication 𝐶 = 𝐴 × 𝐵

using 𝑝 processes
– Decomposition based on output data.
– Each entry of 𝐶 use the same amount of computation.
– Either 1D or 2D block distribution can be used:
𝑛
• 1D distribution: rows are assigned to a process
𝑝
• 2D distribution: 𝑛/ 𝑝 × 𝑛/ 𝑝 size block is assigned to a process
– Multi-D distribution allows higher degree of concurrency.
– Multi-D distribution can also help to reduce interactions

7
8
Cyclic and Block Cyclic Distributions

• If the amount of work differs for different

entries of a matrix, a block distribution can
lead to load imbalances.
• Example. Doolittle’s method of LU factorization
of dense matrix

9
Doolittle’s method of LU factorization

𝑎11 𝑎12 … 𝑎1𝑛 1 0 … 0 𝑢11 𝑢12 … 𝑢1𝑛

𝑎21 𝑎22 … 𝑎2𝑛 𝑙21 1 … 0 0 𝑢22 … 𝑢2𝑛
𝐴= ⋮ ⋮ ⋱ ⋮ = 𝐿𝑈 = ⋮ ⋮ ⋱ ⋮
⋮ ⋮ ⋱ ⋮
𝑎𝑛1 𝑎𝑛2 … 𝑎𝑛𝑛 𝑙𝑛1 𝑙𝑛2 … 1 0 0 … 𝑢𝑛𝑛

By matrix-matrix multiplication

𝑢1𝑗 = 𝑎1𝑗 , 𝑗 = 1,2, … , 𝑛 (1𝑠𝑡 row of 𝑈)

𝑙𝑗1 = 𝑎𝑗1 /𝑢11 , 𝑗 = 1,2, … , 𝑛 (1𝑠𝑡 column of 𝐿)
For 𝑖 = 2,3, … , 𝑛 − 1 do
𝑢𝑖𝑖 = 𝑎𝑖𝑖 − 𝑖−1𝑡=1 𝑙𝑖𝑡 𝑢𝑡𝑗

𝑖−1
𝑢𝑖𝑗 = 𝑎𝑖𝑗 − 𝑡=1 𝑙𝑖𝑡 𝑢𝑡𝑗 for 𝑗 = 𝑖 + 1, … , 𝑛 (𝑖𝑡ℎ row of 𝑈)
𝑎𝑗𝑖 − 𝑖−1
𝑡=1 𝑙𝑗𝑡 𝑢𝑡𝑖
𝑙𝑗𝑖 = for 𝑗 = 𝑖 + 1, … , 𝑛 (𝑖𝑡ℎ column of 𝐿)
𝑢𝑖𝑖

End
𝑛−1
𝑢𝑛𝑛 = 𝑎𝑛𝑛 − 𝑡=1 𝑙𝑛𝑡 𝑢𝑡𝑛

10
Serial Column-Based LU

• Remark: Matrices L and U share space with A

11
Work used to compute Entries of L and U

12
• Block distribution of LU factorization tasks
leads to load imbalance.

13
Block-Cyclic Distribution

• A variation of block distribution that can be

used to alleviate the load-imbalance.

• Steps
1. Partition an array into many more blocks than
the number of available processes
2. Assign blocks to processes in a round-robin
manner so that each process gets several non-
adjacent blocks.

14
(a) The rows of the array are grouped into blocks each consisting of two rows,
resulting in eight blocks of rows. These blocks are distributed to four processes
in a wraparound fashion.
(b) The matrix is blocked into 16 blocks each of size 4×4, and it is mapped onto a
2×2 grid of processes in a wraparound fashion.
• Cyclic distribution: when the block size =1
15
Graph Partitioning
• Assign equal number of nodes (or cells) to each process
• Minimize edge count of the graph partition

Random Partitioning Partitioning for Minimizing Edge-Count

16
Mappings Based on Task Partitioning

• Mapping based on task partitioning can be used

when computation is naturally expressed in the
form of a static task-dependency graph with
known sizes.
• Finding optimal mapping minimizing idle time and
minimizing interaction time is NP-complete
• Heuristic solutions exist for many structured
graphs

17
Mapping a Binary Tree Task-Dependency Graph

• Finding min.

• Mapping the tree graph onto 8 processes

• Mapping minimizes the interaction overhead by mapping independent
tasks onto the same process (i.e., process 0) and others on processes
only one communication link away from each other
• Idling exists. This is inherent in the graph 18
Mapping a Sparse Graph
Example. Sparse matrix-vector multiplication using 3
processes
• Arrow distribution

19
• Partitioning task interaction graph to reduce
interaction overhead

20
Schemes for Dynamic Mapping

• When static mapping results in highly imbalanced

distribution of work among processes or when
task-dependency graph is dynamic, use dynamic
mapping
• Primary goal is to balance load – dynamic load
balancing
– Example: Dynamic load balancing for AMR
• Types
– Centralized
– Distributed

21
Centralized Dynamic Mapping

• Processes
– Master: mange a group of available tasks
– Slave: depend on master to obtain work
• Idea
– When a slave process has no work, it takes a portion of available
work from master
– When a new task is generated, it is added to the pool of tasks in
the master process
• Potential problem
– When many processes are used, mast process may become
bottleneck
• Solution
– Chunk scheduling: every time a process runs out of work it gets
a group of tasks.

22
Distributed Dynamic Mapping

• All processes are peers. Tasks are distributed

among processes which exchange tasks at run
time to balance work
• Each process can send or receive work from other
processes
– How are sending and receiving processes paired
together
– Is the work transfer initiated by the sender or the
receiver?
– How much work is transferred?
– When is the work transfer performed?

23
Techniques to Minimize Interaction Overheads

• Maximize data locality

– Maximize the reuse of recently accessed data
– Minimize volume of data-exchange
• Use high dimensional distribution. Example: 2D block
distribution for matrix multiplication
– Minimize frequency of interactions
• Reconstruct algorithm such that shared data are accessed
and used in large pieces.
• Combine messages between the same source-destination
pair

24
Techniques to Minimize Interaction Overheads
• Minimize contention and hot spots
– Contention occur when multi-tasks try to access the same resources
concurrently: multiple processes sending message to the same
process; multiple simultaneous accesses to the same memory block

𝑝−1
• Using 𝐶𝑖,𝑗 = 𝑘=0 𝐴𝑖,𝑘 𝐵𝑘,𝑗 causes contention. For example, 𝐶0,0 ,
𝐶0,1 , 𝐶0, 𝑝−1 attempt to read 𝐴0,0 , at once.
• A contention-free manner is to use:
𝑝−1
𝐶𝑖,𝑗 = 𝑘=0 𝐴𝑖, 𝑖+𝑗+𝑘 % 𝑝 𝐵 𝑖+𝑗+𝑘 % 𝑝,𝑗
All tasks 𝑃∗,𝑗 that work on the same row of C access block
𝐴𝑖, 𝑖+𝑗+𝑘 % 𝑝 , which is different for each task. 25
Techniques to Minimize Interaction Overheads
• Overlap computations with interactions
– Use non-blocking communication
• Replicate data or computations
– Replicate a copy of shared data on each process if
possible, so that there is only initial interaction during
replication.
• Use collective interaction operations
• Overlap interactions with other interactions

26
Parallel Algorithm Models
• Data parallel
– Each task performs similar operations on different data
– Typically statically map tasks to processes
• Task graph
– Use task dependency graph to promote locality or reduce
interactions
• Master-slave
– One or more master processes generating tasks
– Allocate tasks to slave processes
– Allocation may be static or dynamic
• Pipeline/producer-consumer
– Pass a stream of data through a sequence of processes
– Each performs some operation on it
• Hybrid
– Apply multiple models hierarchically, or apply multiple models
in sequence to different phases
27

Monitoring Active Directory Attacks With Wazuh-4.10
No ratings yet
Monitoring Active Directory Attacks With Wazuh-4.10
20 pages
Java Lab Manual r20 Updated
No ratings yet
Java Lab Manual r20 Updated
55 pages
Arrays and Strings C++
No ratings yet
Arrays and Strings C++
27 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-02-07 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-02-07 Reference-Material-I
35 pages
Pda 4
No ratings yet
Pda 4
82 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
2 DataflowAnalysis
No ratings yet
2 DataflowAnalysis
49 pages
Lecture06 - High-Level Digital Design Automation
No ratings yet
Lecture06 - High-Level Digital Design Automation
31 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
Parallel
No ratings yet
Parallel
59 pages
HPC Unit 5 A
No ratings yet
HPC Unit 5 A
49 pages
Intro To Python - Day 3
No ratings yet
Intro To Python - Day 3
72 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Log FrenteCaixa
No ratings yet
Log FrenteCaixa
57 pages
Matrix Mul
No ratings yet
Matrix Mul
33 pages
3.1.3 Processes and Mapping (1/5)
No ratings yet
3.1.3 Processes and Mapping (1/5)
74 pages
gc ٢٠٢٤ ١٢ ٢٦
No ratings yet
gc ٢٠٢٤ ١٢ ٢٦
24 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
WINSEM2022 23 CSE4001 ETH VL2022230503182 Reference Material I 02
No ratings yet
WINSEM2022 23 CSE4001 ETH VL2022230503182 Reference Material I 02
28 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Lecture 3 and 4HPC
No ratings yet
Lecture 3 and 4HPC
24 pages
VSS NumericalLibraries
No ratings yet
VSS NumericalLibraries
21 pages
HPC Notes Unit 3
No ratings yet
HPC Notes Unit 3
7 pages
Melhores Praticas Ifix
No ratings yet
Melhores Praticas Ifix
113 pages
Handout - Sensors - Embedded Programming - ESP32 v1.0
No ratings yet
Handout - Sensors - Embedded Programming - ESP32 v1.0
11 pages
CPP Unit-4
No ratings yet
CPP Unit-4
61 pages
Unit-Iv Concurrent and Parallel Programming: Parallel Programming Paradigms - Data Parallel
No ratings yet
Unit-Iv Concurrent and Parallel Programming: Parallel Programming Paradigms - Data Parallel
61 pages
Context Free Grammars
No ratings yet
Context Free Grammars
122 pages
Partitioning
No ratings yet
Partitioning
37 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Java Unit-4 Assignment Answers
No ratings yet
Java Unit-4 Assignment Answers
8 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Chapter Six
No ratings yet
Chapter Six
19 pages
Programming For Performance
No ratings yet
Programming For Performance
79 pages
Yash Mittal Final Project Documentation
No ratings yet
Yash Mittal Final Project Documentation
114 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
Chapter Six
No ratings yet
Chapter Six
18 pages
CCNA Training OSPF Questions
No ratings yet
CCNA Training OSPF Questions
3 pages
Network Capacity Planning
No ratings yet
Network Capacity Planning
14 pages
Common PDC Module3
No ratings yet
Common PDC Module3
43 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
Assignment No. 2 PDC 21L-1786
No ratings yet
Assignment No. 2 PDC 21L-1786
6 pages
XZ000-G3 (Eusp) 2.0 Qig V2.0.1
No ratings yet
XZ000-G3 (Eusp) 2.0 Qig V2.0.1
2 pages
Parallel Programming Models: Sathish Vadhiyar
No ratings yet
Parallel Programming Models: Sathish Vadhiyar
32 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Backing Storage Devices: Backing Storage Devices Allow Us To Store Programs and Data So That We Can Use Them Later
No ratings yet
Backing Storage Devices: Backing Storage Devices Allow Us To Store Programs and Data So That We Can Use Them Later
12 pages
Dalgorithm
No ratings yet
Dalgorithm
5 pages
Configuring Assets: 1. Configure Production Areas
No ratings yet
Configuring Assets: 1. Configure Production Areas
3 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
It6202 Lab - 003
No ratings yet
It6202 Lab - 003
4 pages
Hands On Exercises 1 - Getting Started: 1. Equipment
No ratings yet
Hands On Exercises 1 - Getting Started: 1. Equipment
8 pages
1LOC005 - How Do I Allocate Legs To A Drivers Run Sheet
No ratings yet
1LOC005 - How Do I Allocate Legs To A Drivers Run Sheet
2 pages
Parallel Computing
No ratings yet
Parallel Computing
2 pages
Resume Tata Consultancy Services
No ratings yet
Resume Tata Consultancy Services
3 pages
C++ Shell
No ratings yet
C++ Shell
1 page
Unit-Iv Syllabus What Is Greedy Approach?: Greedy: Interval Scheduling, Minimum Cost
No ratings yet
Unit-Iv Syllabus What Is Greedy Approach?: Greedy: Interval Scheduling, Minimum Cost
15 pages
Eden Gebrekidan Front End Developer: Buffalo, NY 7162592843
No ratings yet
Eden Gebrekidan Front End Developer: Buffalo, NY 7162592843
2 pages
X. Mapping Techniques: 27 April, 2009
No ratings yet
X. Mapping Techniques: 27 April, 2009
27 pages
2022 Mid 1
No ratings yet
2022 Mid 1
4 pages
Module 1: Introduction and Asymptotic Analysis: CS 240 - Data Structures and Data Management
No ratings yet
Module 1: Introduction and Asymptotic Analysis: CS 240 - Data Structures and Data Management
48 pages
Nis Pyq 22620
No ratings yet
Nis Pyq 22620
2 pages
MultiVue Quick Guide (EPIQ Evolution 5.0 & Affiniti Continuum 3.0) Newpdf
No ratings yet
MultiVue Quick Guide (EPIQ Evolution 5.0 & Affiniti Continuum 3.0) Newpdf
5 pages
M - Sequence and P - Sequencer in UVM
No ratings yet
M - Sequence and P - Sequencer in UVM
4 pages
Formal Languages and Automata Theory
No ratings yet
Formal Languages and Automata Theory
7 pages
S R T S: OME Esearch Opics For Tudents
No ratings yet
S R T S: OME Esearch Opics For Tudents
3 pages
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
No ratings yet
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
16 pages
MCSE011
No ratings yet
MCSE011
18 pages
Manoj Devarapalli: - Career Objective
No ratings yet
Manoj Devarapalli: - Career Objective
3 pages
Con Currency Mapping
No ratings yet
Con Currency Mapping
40 pages
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
No ratings yet
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
49 pages
Multi-Mode Router: Meet All Your Needs. TL-WR841N
No ratings yet
Multi-Mode Router: Meet All Your Needs. TL-WR841N
2 pages
Parallel Models of Computation
No ratings yet
Parallel Models of Computation
3 pages
Design Pattern For Graph Algorithms
No ratings yet
Design Pattern For Graph Algorithms
72 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
Self-Assessment Questions 4-1: 1. Discuss The Stages in The Program Development Process (10 PTS)
No ratings yet
Self-Assessment Questions 4-1: 1. Discuss The Stages in The Program Development Process (10 PTS)
1 page
Introduction
No ratings yet
Introduction
46 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
ABAP Development For SAP HANA
50% (2)
ABAP Development For SAP HANA
2 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet

Lecture 4: Principles of Parallel Algorithm Design (Part 4)

Uploaded by

Lecture 4: Principles of Parallel Algorithm Design (Part 4)

Uploaded by

Lecture 4: Principles of Parallel

Algorithm Design (part 4)

• Mapping Based on Data Partitioning

• By owner-computes rule, mapping the relevant

Example. 𝑛 × 𝑛 dense matrix multiplication 𝐶 = 𝐴 × 𝐵

• If the amount of work differs for different

𝑎11 𝑎12 … 𝑎1𝑛 1 0 … 0 𝑢11 𝑢12 … 𝑢1𝑛

𝑢1𝑗 = 𝑎1𝑗 , 𝑗 = 1,2, … , 𝑛 (1𝑠𝑡 row of 𝑈)

• Remark: Matrices L and U share space with A

• A variation of block distribution that can be

Random Partitioning Partitioning for Minimizing Edge-Count

• Mapping based on task partitioning can be used

• Mapping the tree graph onto 8 processes

• When static mapping results in highly imbalanced

• All processes are peers. Tasks are distributed

• Maximize data locality

You might also like