0% found this document useful (0 votes)

327 views52 pages

Chapter 3 - Principles of Parallel Algorithm Design

This document provides an introduction to parallel computing and parallel algorithm design. It discusses key elements of parallel algorithms including identifying concurrent tasks, mapping tasks to processors, and managing shared data and synchronization. Common task decomposition methods like data decomposition, recursive decomposition, and speculative decomposition are described. The document also covers parallel mapping techniques including static mappings like block distributions and dynamic load balancing schemes. The goal is to maximize concurrency while minimizing overhead from parallelization.

Uploaded by

Mumbi Njoroge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

327 views52 pages

Chapter 3 - Principles of Parallel Algorithm Design

Uploaded by

Mumbi Njoroge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to Parallel Computing

George Karypis Principles of Parallel Algorithm Design

Outline

Overview of some Serial Algorithms Parallel Algorithm vs Parallel Formulation Elements of a Parallel Algorithm/Formulation Common Decomposition Methods
concurrency

extractor!

Common Mapping Methods

parallel

overhead reducer!

Some Serial Algorithms

Working Examples Dense Matrix-Matrix & Matrix-Vector Multiplication Sparse Matrix-Vector Multiplication Gaussian Elimination Floyds All-pairs Shortest Path Quicksort Minimum/Maximum Finding Heuristic Search15-puzzle problem

Dense Matrix-Vector Multiplication

Dense Matrix-Matrix Multiplication

Sparse Matrix-Vector Multiplication

Gaussian Elimination

Floyds All-Pairs Shortest Path

Quicksort

Minimum Finding

15Puzzle Problem

Parallel Algorithm vs Parallel Formulation

Parallel Formulation
Refers May

to a parallelization of a serial algorithm.

Parallel Algorithm
represent an entirely different algorithm than the one used serially.

We primarily focus on Parallel Formulations

Our

goal today is to primarily discuss how to develop such parallel formulations. Of course, there will always be examples of parallel algorithms that were not derived from serial algorithms.

Elements of a Parallel Algorithm/Formulation

Pieces of work that can be done concurrently

tasks processes vs processors

Mapping of the tasks onto multiple processors Distribution of input/output & intermediate data across the different processors Management the access of shared data

either input or intermediate

Synchronization of the processors at various points of the parallel execution

Holy Grail: Maximize concurrency and reduce overheads due to parallelization! Maximize potential speedup!

Finding Concurrent Pieces of Work

Decomposition:
The

process of dividing the computation into smaller pieces of work i.e., tasks

Tasks are programmer defined and are considered to be indivisible

Example: Dense Matrix-Vector Multiplication

Tasks can be of different size. granularity of a task

Example: Query Processing

Query:

Example: Query Processing

Finding concurrent tasks

Task-Dependency Graph

In most cases, there are dependencies between the different tasks

certain

task(s) can only start once some other task(s) have finished
e.g., producer-consumer relationships

These dependencies are represented using a DAG called task-dependency graph

Task-Dependency Graph (cont)

Key Concepts Derived from the TaskDependency Graph

Degree

of Concurrency

The number of tasks that can be concurrently executed

we usually care about the average degree of concurrency

Critical

Path

The longest vertex-weighted path in the graph

The weights represent task size

Task

granularity affects both of the above characteristics

Task-Interaction Graph

Captures the pattern of interaction between tasks

This

graph usually contains the task-dependency graph as a subgraph

i.e., there may be interactions between tasks even if there are no dependencies

these interactions usually occur due to accesses on shared data

Task Dependency/Interaction Graphs

These graphs are important in developing effectively mapping the tasks onto the different processors
Maximize

concurrency and minimize overheads

Common Decomposition Methods

Data Decomposition Recursive Decomposition Exploratory Decomposition Speculative Decomposition Hybrid Decomposition

Task decomposition methods

Recursive Decomposition
Suitable for problems that can be solved using the divide-and-conquer paradigm Each of the subproblems generated by the divide step becomes a task

Example: Quicksort

Example: Finding the Minimum

Note that we can obtain divide-and-conquer algorithms for problems that are traditionally solved using nondivide-and-conquer approaches

Recursive Decomposition

How good are the decompositions that it produces?

average

concurrency? critical path?

How do the quicksort and min-finding decompositions measure-up?

Data Decomposition

Used to derive concurrency for problems that operate on large amounts of data The idea is to derive the tasks by focusing on the multiplicity of data Data decomposition is often performed in two steps

Step 1: Partition the data Step 2: Induce a computational partitioning from the data partitioning Input/Output/Intermediate?

Which data should we partition?

Well all of the aboveleading to different data decomposition methods

How do induce a computational partitioning?

Owner-computes rule

Example: Matrix-Matrix Multiplication

Partitioning the output data

Example: Matrix-Matrix Multiplication

Partitioning the intermediate data

Data Decomposition

Is the most widely-used decomposition technique

after

all parallel processing is often applied to problems that have a lot of data splitting the work based on this data is the natural way to extract high-degree of concurrency

It is used by itself or in conjunction with other decomposition methods

Hybrid

decomposition

Exploratory Decomposition

Used to decompose computations that correspond to a search of a space of solutions

Example: 15-puzzle Problem

Exploratory Decomposition
It is not as general purpose It can result in speedup anomalies

engineered

slow-down or superlinear

speedup

Speculative Decomposition
Used to extract concurrency in problems in which the next step is one of many possible actions that can only be determined when the current tasks finishes This decomposition assumes a certain outcome of the currently executed task and executes some of the next steps

Just

like speculative execution at the microprocessor level

Example: Discrete Event Simulation

Speculative Execution

If predictions are wrong

work

is wasted work may need to be undone

state-restoring overhead

memory/computations

However, it may be the only way to extract concurrency!

Mapping the Tasks

Why do we care about task mapping?

Can I just randomly assign them to the available processors?

Proper mapping is critical as it needs to minimize the parallel processing overheads

If Tp is the parallel runtime on p processors and Ts is the serial runtime, then the total overhead To is p*Tp Ts

The work done by the parallel system beyond that required by the serial system Load imbalance Inter-process communication

they can be at odds with each other

Overhead sources:

remember the holy grail

coordination/synchronization/data-sharing

Why Mapping can be Complicated?

Proper mapping needs to take into account the task-dependency and interaction graphs

Are the tasks available a priori?

Static vs dynamic task generation Are they uniform or non-uniform? Do we know them a priori?

How about their computational requirements?

Task dependency graph

How much data is associated with each task? How about the interaction patterns between the tasks?

Are they static or dynamic? Do we know them a priori? Are they data instance dependent? Are they regular or irregular? Are they read-only or read-write?

Task interaction graph

Depending on the above characteristics different mapping techniques are required of different complexity and cost

Example: Simple & Complex Task Interaction

Mapping Techniques for Load Balancing

Be aware
The

assignment of tasks whose aggregate computational requirements are the same does not automatically ensure load balance.

Each processor is assigned three tasks but (a) is better than (b)!

Load Balancing Techniques

Static
The

tasks are distributed among the processors prior to the execution Applicable for tasks that are

generated statically known and/or uniform computational requirements

Dynamic
The

tasks are distributed among the processors during the execution of the algorithm
i.e., tasks & data are migrated

Applicable

for tasks that are

generated dynamically unknown computational requirements

Static MappingArray Distribution

Suitable for algorithms that

use

data decomposition their underlying input/output/intermediate data are in the form of arrays

Block Distribution Cyclic Distribution Block-Cyclic Distribution Randomized Block Distributions

1D/2D/3D

Examples: Block Distributions

Example: Block-Cyclic Distributions

Gaussian Elimination
The active portion of the array shrinks as the computations progress

Random Block Distributions

Sometimes the computations are performed only at certain portions of an array

sparse

matrix-matrix multiplication

Random Block Distributions

Better load balance can be achieved via a random block distribution

Graph Partitioning

A mapping can be achieved by directly partitioning the task interaction graph.

EG:

Finite element mesh-based computations

Directly partitioning this graph

Example: Sparse Matrix-Vector

Another instance of graph partitioning

Dynamic Load Balancing Schemes

There is a huge body of research

Centralized

Schemes

A certain processors is responsible for giving out work

master-slave paradigm task granularity

Issue:

Distributed

Schemes

Work can be transferred between any pairs of processors. Issues:

How do the processors get paired? Who initiates the work transfer? push vs pull How much work is transferred?

Mapping to Minimize Interaction Overheads

Maximize data locality Minimize volume of data-exchange Minimize frequency of interactions Minimize contention and hot spots Overlap computation with interactions Selective data and computation replication
Achieving the above is usually an interplay of decomposition and mapping and is usually done iteratively

Unit 2
No ratings yet
Unit 2
151 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
PDC UNIT-2
No ratings yet
PDC UNIT-2
48 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Parallel Algorithms Complete Notes
No ratings yet
Parallel Algorithms Complete Notes
13 pages
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
No ratings yet
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
84 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
89 pages
Unit 2
No ratings yet
Unit 2
64 pages
Unit 2
No ratings yet
Unit 2
81 pages
Hpc_unit-2 Insem Notes
No ratings yet
Hpc_unit-2 Insem Notes
99 pages
Unit 2 HPC
No ratings yet
Unit 2 HPC
92 pages
3.1.3 Processes and Mapping (1/5)
No ratings yet
3.1.3 Processes and Mapping (1/5)
74 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
78 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
Unit 2_Part_1
No ratings yet
Unit 2_Part_1
32 pages
LECTURE 4 - Parallel Computing Design (PART 1)
No ratings yet
LECTURE 4 - Parallel Computing Design (PART 1)
47 pages
PC 10 Esf PDF
No ratings yet
PC 10 Esf PDF
30 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
WINSEM2022_23_CSE4001_ETH_VL2022230503182_Reference_Material_I_02
No ratings yet
WINSEM2022_23_CSE4001_ETH_VL2022230503182_Reference_Material_I_02
28 pages
Lecture4 PDF
No ratings yet
Lecture4 PDF
23 pages
Module 3 - Principles of Parallel Algorithm Design
No ratings yet
Module 3 - Principles of Parallel Algorithm Design
39 pages
Common PDC Module3
No ratings yet
Common PDC Module3
43 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
03-Task Decomposition and Mapping
No ratings yet
03-Task Decomposition and Mapping
62 pages
Con Currency Mapping
No ratings yet
Con Currency Mapping
40 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
in3200-chap05
No ratings yet
in3200-chap05
34 pages
AA-Part1 (1)
No ratings yet
AA-Part1 (1)
43 pages
Processes and Mapping, Decomposition Techniques
No ratings yet
Processes and Mapping, Decomposition Techniques
28 pages
Parallel Algorithms Presentation (1)
No ratings yet
Parallel Algorithms Presentation (1)
32 pages
PDA_1
No ratings yet
PDA_1
72 pages
E- Notes -HPC-Unit 3-1
No ratings yet
E- Notes -HPC-Unit 3-1
26 pages
X. Mapping Techniques: 27 April, 2009
No ratings yet
X. Mapping Techniques: 27 April, 2009
27 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
No ratings yet
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
27 pages
Partitioning
No ratings yet
Partitioning
37 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
No ratings yet
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
18 pages
X. Decomposition and Orchestration: April 27, 2009
No ratings yet
X. Decomposition and Orchestration: April 27, 2009
30 pages
HPC Chapter 2
No ratings yet
HPC Chapter 2
16 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
2.Decomposition Done
No ratings yet
2.Decomposition Done
4 pages
Windows Function PPT
No ratings yet
Windows Function PPT
19 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
Presented by
No ratings yet
Presented by
23 pages
Chapter Six
No ratings yet
Chapter Six
19 pages
Chapter Six
No ratings yet
Chapter Six
18 pages
CSC 580 - Chapter 3
No ratings yet
CSC 580 - Chapter 3
35 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Fractals: applications in biological signalling and image processing 1st Edition Aliahmad - The full ebook with all chapters is available for download
100% (2)
Fractals: applications in biological signalling and image processing 1st Edition Aliahmad - The full ebook with all chapters is available for download
67 pages
Week 3 Parallel Algorithms
No ratings yet
Week 3 Parallel Algorithms
10 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
HPC Unit 2
No ratings yet
HPC Unit 2
2 pages
Load Balancing in Parallel Computing
No ratings yet
Load Balancing in Parallel Computing
5 pages
Lecture 4 - 503 STAT - Joint Distributions
No ratings yet
Lecture 4 - 503 STAT - Joint Distributions
30 pages
Ee8017 High Voltage Direct Current Transmission
No ratings yet
Ee8017 High Voltage Direct Current Transmission
12 pages
Chapter 9 Nonparametric Sign Test
No ratings yet
Chapter 9 Nonparametric Sign Test
26 pages
2022 J2 H2 Prelim P4 QP
No ratings yet
2022 J2 H2 Prelim P4 QP
18 pages
Murex
100% (1)
Murex
5 pages
Backtracking: CENG 707 Data Structures and Algorithms
No ratings yet
Backtracking: CENG 707 Data Structures and Algorithms
50 pages
MNSMS Template
No ratings yet
MNSMS Template
9 pages
Aim-Spice Tutorial: Florida International University
100% (2)
Aim-Spice Tutorial: Florida International University
14 pages
Chapter 2 9th class Physics
No ratings yet
Chapter 2 9th class Physics
2 pages
Pronósticos Medición Del Error: Ing. Msc. Luis Eduardo Leguizamon Castellanos
No ratings yet
Pronósticos Medición Del Error: Ing. Msc. Luis Eduardo Leguizamon Castellanos
15 pages
JR Iit Adv Revision Test
No ratings yet
JR Iit Adv Revision Test
12 pages
IV-Day 1
No ratings yet
IV-Day 1
10 pages
Shayak
No ratings yet
Shayak
6 pages
Math 200 - Abstract Algebra
No ratings yet
Math 200 - Abstract Algebra
4 pages
Modified Genetic Algorithm For High School Time-Table Scheduling With Fuzzy Time Window
No ratings yet
Modified Genetic Algorithm For High School Time-Table Scheduling With Fuzzy Time Window
5 pages
Chapter 6 (CONT') : Application: Powers of Matrices and Their Applications. 1 Powers of Matrices
No ratings yet
Chapter 6 (CONT') : Application: Powers of Matrices and Their Applications. 1 Powers of Matrices
9 pages
Note On Calculation Logic For Vendor Evaluation
No ratings yet
Note On Calculation Logic For Vendor Evaluation
6 pages
10 Understanding Animation
No ratings yet
10 Understanding Animation
32 pages
What The Dime Is in A Name - Mwadimeh Wa'kesho
100% (1)
What The Dime Is in A Name - Mwadimeh Wa'kesho
3 pages
EMDq - Question Bank - Jul 2019
0% (1)
EMDq - Question Bank - Jul 2019
18 pages
Explains What Factorial Is.
No ratings yet
Explains What Factorial Is.
9 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 2
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 2
3 pages
Hydropower Measurement Manual: Head & Flow
No ratings yet
Hydropower Measurement Manual: Head & Flow
3 pages
Solutions ch10
No ratings yet
Solutions ch10
14 pages
Guc 459 62 44021 2024-05-01T11 53 41
No ratings yet
Guc 459 62 44021 2024-05-01T11 53 41
11 pages
Session 1 - Knowing and Understanding The Numeracy Assessment Tools Rationale, Framework and Features
No ratings yet
Session 1 - Knowing and Understanding The Numeracy Assessment Tools Rationale, Framework and Features
16 pages
Abaqus - Nonlinear Analysis of Reinforced Concrete Beam Experimentation2015
No ratings yet
Abaqus - Nonlinear Analysis of Reinforced Concrete Beam Experimentation2015
5 pages
Research Reviewer
No ratings yet
Research Reviewer
2 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet

Chapter 3 - Principles of Parallel Algorithm Design

Uploaded by

Chapter 3 - Principles of Parallel Algorithm Design

Uploaded by

Introduction to Parallel Computing

George Karypis Principles of Parallel Algorithm Design

Common Mapping Methods

Some Serial Algorithms

Dense Matrix-Vector Multiplication

Dense Matrix-Matrix Multiplication

Sparse Matrix-Vector Multiplication

Floyds All-Pairs Shortest Path

Parallel Algorithm vs Parallel Formulation

to a parallelization of a serial algorithm.

We primarily focus on Parallel Formulations

Elements of a Parallel Algorithm/Formulation

Pieces of work that can be done concurrently

tasks processes vs processors

either input or intermediate

Synchronization of the processors at various points of the parallel execution

Finding Concurrent Pieces of Work

Tasks are programmer defined and are considered to be indivisible

Example: Dense Matrix-Vector Multiplication

Example: Query Processing

Example: Query Processing

Finding concurrent tasks

In most cases, there are dependencies between the different tasks

These dependencies are represented using a DAG called task-dependency graph

Task-Dependency Graph (cont)

Key Concepts Derived from the TaskDependency Graph

The number of tasks that can be concurrently executed

we usually care about the average degree of concurrency

The longest vertex-weighted path in the graph

granularity affects both of the above characteristics

Captures the pattern of interaction between tasks

graph usually contains the task-dependency graph as a subgraph

these interactions usually occur due to accesses on shared data

Task Dependency/Interaction Graphs

concurrency and minimize overheads

More on this later

Common Decomposition Methods

Task decomposition methods

Example: Finding the Minimum

How good are the decompositions that it produces?

concurrency? critical path?

How do the quicksort and min-finding decompositions measure-up?

Which data should we partition?

Well all of the aboveleading to different data decomposition methods

How do induce a computational partitioning?

Example: Matrix-Matrix Multiplication

Partitioning the output data

Example: Matrix-Matrix Multiplication

Partitioning the intermediate data

Is the most widely-used decomposition technique

It is used by itself or in conjunction with other decomposition methods

Used to decompose computations that correspond to a search of a space of solutions

Example: 15-puzzle Problem

like speculative execution at the microprocessor level

Example: Discrete Event Simulation

If predictions are wrong

is wasted work may need to be undone

However, it may be the only way to extract concurrency!

Mapping the Tasks

Why do we care about task mapping?

Can I just randomly assign them to the available processors?

Proper mapping is critical as it needs to minimize the parallel processing overheads

they can be at odds with each other

remember the holy grail

Why Mapping can be Complicated?

Are the tasks available a priori?

How about their computational requirements?

Task dependency graph

Task interaction graph

Example: Simple & Complex Task Interaction

Mapping Techniques for Load Balancing

Load Balancing Techniques

generated statically known and/or uniform computational requirements

for tasks that are

generated dynamically unknown computational requirements

Static MappingArray Distribution