Chapter 3 - Principles of Parallel Algorithm Design
Chapter 3 - Principles of Parallel Algorithm Design
Chapter 3 - Principles of Parallel Algorithm Design
Outline
Overview of some Serial Algorithms Parallel Algorithm vs Parallel Formulation Elements of a Parallel Algorithm/Formulation Common Decomposition Methods
concurrency
extractor!
overhead reducer!
Gaussian Elimination
Quicksort
Minimum Finding
15Puzzle Problem
Parallel Formulation
Refers May
Parallel Algorithm
represent an entirely different algorithm than the one used serially.
goal today is to primarily discuss how to develop such parallel formulations. Of course, there will always be examples of parallel algorithms that were not derived from serial algorithms.
Mapping of the tasks onto multiple processors Distribution of input/output & intermediate data across the different processors Management the access of shared data
Holy Grail: Maximize concurrency and reduce overheads due to parallelization! Maximize potential speedup!
Decomposition:
The
process of dividing the computation into smaller pieces of work i.e., tasks
Query:
Task-Dependency Graph
task(s) can only start once some other task(s) have finished
e.g., producer-consumer relationships
of Concurrency
Critical
Path
Task
Task-Interaction Graph
These graphs are important in developing effectively mapping the tasks onto the different processors
Maximize
Recursive Decomposition
Suitable for problems that can be solved using the divide-and-conquer paradigm Each of the subproblems generated by the divide step becomes a task
Example: Quicksort
Note that we can obtain divide-and-conquer algorithms for problems that are traditionally solved using nondivide-and-conquer approaches
Recursive Decomposition
Data Decomposition
Used to derive concurrency for problems that operate on large amounts of data The idea is to derive the tasks by focusing on the multiplicity of data Data decomposition is often performed in two steps
Step 1: Partition the data Step 2: Induce a computational partitioning from the data partitioning Input/Output/Intermediate?
Owner-computes rule
Data Decomposition
all parallel processing is often applied to problems that have a lot of data splitting the work based on this data is the natural way to extract high-degree of concurrency
decomposition
Exploratory Decomposition
Exploratory Decomposition
It is not as general purpose It can result in speedup anomalies
engineered
slow-down or superlinear
speedup
Speculative Decomposition
Used to extract concurrency in problems in which the next step is one of many possible actions that can only be determined when the current tasks finishes This decomposition assumes a certain outcome of the currently executed task and executes some of the next steps
Just
Speculative Execution
state-restoring overhead
memory/computations
If Tp is the parallel runtime on p processors and Ts is the serial runtime, then the total overhead To is p*Tp Ts
The work done by the parallel system beyond that required by the serial system Load imbalance Inter-process communication
Overhead sources:
coordination/synchronization/data-sharing
Proper mapping needs to take into account the task-dependency and interaction graphs
Static vs dynamic task generation Are they uniform or non-uniform? Do we know them a priori?
How much data is associated with each task? How about the interaction patterns between the tasks?
Are they static or dynamic? Do we know them a priori? Are they data instance dependent? Are they regular or irregular? Are they read-only or read-write?
Depending on the above characteristics different mapping techniques are required of different complexity and cost
Be aware
The
assignment of tasks whose aggregate computational requirements are the same does not automatically ensure load balance.
Each processor is assigned three tasks but (a) is better than (b)!
Static
The
tasks are distributed among the processors prior to the execution Applicable for tasks that are
Dynamic
The
tasks are distributed among the processors during the execution of the algorithm
i.e., tasks & data are migrated
Applicable
data decomposition their underlying input/output/intermediate data are in the form of arrays
1D/2D/3D
Gaussian Elimination
The active portion of the array shrinks as the computations progress
matrix-matrix multiplication
Graph Partitioning
Schemes
Issue:
Distributed
Schemes
How do the processors get paired? Who initiates the work transfer? push vs pull How much work is transferred?