0% found this document useful (0 votes)
93 views40 pages

Con Currency Mapping

This document discusses parallel processing and mapping techniques for load balancing. It describes characteristics of tasks such as generation strategy, size, and associated data. Mapping techniques include static mappings using data or graph partitioning, and dynamic mappings using centralized or distributed approaches. The goal of mappings is to minimize idling and communication overheads through balanced work distribution across processes.

Uploaded by

Mazen Alkoa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views40 pages

Con Currency Mapping

This document discusses parallel processing and mapping techniques for load balancing. It describes characteristics of tasks such as generation strategy, size, and associated data. Mapping techniques include static mappings using data or graph partitioning, and dynamic mappings using centralized or distributed approaches. The goal of mappings is to minimize idling and communication overheads through balanced work distribution across processes.

Uploaded by

Mazen Alkoa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Parallel Processing

Concurrency and Mapping


Samer Arandi
[email protected]

Parallel Processing 66523

Computer Engineering Department

An-Najah National University

Outline

Characteristics of tasks and interactions


task generation, granularity, and context characteristics of task interactions

Mapping techniques for load balancing


static mappings dynamic mappings

Methods for minimizing interaction overheads Parallel algorithm design templates

11

Characteristics of Tasks
Key characteristics
generation strategy associated work associated data size

Impact choice and performance of parallel algorithms

12

Task Generation
Static task generation
identify concurrent tasks a-priori typically data or recursive decomposition leads to static tasks generation examples
- matrix operations - graph algorithms - image processing applications
- other regularly structured problems

Dynamic task generation


identify concurrent tasks as a computation unfolds (tasks & depend. graph) the rules governing the generation of tasks are known as part of the alg. typically a result of exploratory or speculative decompositions examples
- puzzle solving (15-puzzle) - game playing

recursive can also lead to dynamic tasks generation (quicksort) exploratory can also lead to static tasks generation (15-puzzle)

13

Task Sizes
Task Size: amount of time required for completion Uniform: all the same size (example?) Non-uniform
sometimes sizes are known or can be estimated a-priori sometimes not
- example: tasks in quicksort

size of each partition depends upon pivot selected

Implications on mapping?

14

Size of Data Associated with Tasks


Data may be small or large compared to the computation
size(input) < size(computation), e.g., 15 puzzle size(input) = size(computation) > size(output), e.g., min size(input) = size(output) <= size(computation), e.g., sort

Implications
small data: task can easily migrate to another process large data: ties the task to a process
- possibly can avoid communicating the task context
reconstruct/recompute the context elsewhere

15

Characteristics of Task Interactions


Orthogonal classification criteria

Static vs. dynamic Regular vs. irregular Read-only vs. read-write One-sided vs. two-sided

16

Characteristics of Task Interactions


Static interactions
tasks and interactions are known a-priori simpler to code

Dynamic interactions
timing or interacting tasks cannot be determined a-priori harder to code
- especially using two-sided message passing APIs

17

Characteristics of Task Interactions


Regular interactions
interactions have a pattern that can be described with a function
- e.g. mesh, ring

regular patterns can be exploited for efficient implementation


- e.g. schedule communication to avoid conflicts on network links

Irregular interactions
lack a well-defined topology modeled by a graph

18

Static Regular Task Interaction Pattern


Image operations, e.g. edge detection Nearest neighbor interactions on a 2D mesh

19

Static Irregular Task Interaction Pattern


Sparse matrix-vector multiply

A task must scan its associated row(s) of A to know which entry -of vector b- it requires (implies the tasks it needs to interact with)
20

Characteristics of Task Interactions


Read-only interactions
tasks only read data associated with other tasks example: Matrix Multiplication (shared: A and B)

Read-write interactions
read and modify data associated with other tasks
example: shared tasks priority queues

harder to code: requires synchronization


- need to avoid ordering races (read-write and write-write, etc)

21

Characteristics of Task Interactions


One-sided
initiated & completed independently by 1 of 2 interacting tasks
- GET - PUT

Two-sided
both tasks coordinate in an interaction
- SEND + RECV

22

Outline

Characteristics of tasks and interactions


task generation, granularity, and context characteristics of task interactions

Mapping techniques for load balancing


static mappings dynamic mappings

Methods for minimizing interaction overheads Parallel algorithm design templates

11

Mapping Techniques
Map concurrent tasks to processes for execution Goal: all tasks complete in the shortest possible time

Overheads of mappings
serialization (idling) - due to uneven load balancing/dependencies communication

A good mapping tries to minimize both sources of overheads Conflicting objectives: minimizing one increases the other
assigning all work to one processor (going to the extreme)
- minimizes communication - significant idling

minimizing serialization introduces communication

24

Mapping Techniques for Minimum Idling


Overall load balancing alone doesnt necessarily minimize idling
Task dependency graph determines when a task can run Must balance computation and interactions at each stage

Time

Time

25

Mapping Techniques for Minimum Idling


Static vs. dynamic mappings

Static mapping
a-priori mapping of tasks to processes requirements
- a good estimate of task size - even so, optimal mapping may be NP complete

e.g., multiple knapsack problem

Dynamic mapping
map tasks to processes at runtime why?
- tasks are generated at runtime, or

- their sizes are unknown

need to make sure cost of moving data doesnt outweigh the Factors that influence choice of mapping benefit of dynamic mapping size of data associated with a task nature of underlying domain
26

Schemes for Static Mapping


Data partitionings Task graph partitionings Hybrid strategies

27

Mappings Based on Data Partitioning


Partition computation using a combination of
data partitioning owner-computes rule

Example: 1-D block distribution for dense matrices

28

Block Array Distribution Schemes


Multi-dimensional block distributions

Multi-dimensional partitioning enables larger # of processes

29

Block Array Distribution Example


Multiplying two dense matrices C = A x B

Partition the output matrix C using a block decomposition Give each task the same number of elements of C
each element of C corresponds to a dot product even load balance

Obvious choices: 1D or 2D decomposition Select to minimize associated communication overhead

30

Imbalance and Block Array Distributions


Consider a block distribution for LU decomposition
Computing different blocks requires different amounts of work
If we map all tasks associated with a certain block onto a process in a 9-process ensemble=> imbalance => significant idle time

Another computation with similar distribution challenges


Gaussian Elimination
33

Block Cyclic Distribution


Variant of the block distribution scheme that can be used to alleviate the load-imbalance and idling

Steps
1. partition an array into many more blocks than the number of available processes 2. assign blocks to processes in a round-robin manner
- each process gets several non-adjacent blocks

34

Block-Cyclic Distribution

(a) 1D block-cyclic

(b) 2D block-cyclic

In certain cases even block-cyclic results in imbalance: - Randomized Block Distribution


35

Decomposition by Graph Partitioning


Data partitioning is very effective for problems that use dense matrices and have regular interaction patterns.

However, some problems utilize sparse matrices and have datadependent and irregular interaction patters Sparse-matrix vector multiply

Graph of the matrix is useful for decomposition


work ~ number of edges communication for a node ~ node degree

Goal: balance work & minimize communication Partition the graph


assign equal number of nodes to each process minimize edge count of the graph partition
36

Partitioning a Graph of Lake Superior

Random Partitioning ( 8 processes)

Partitioning for minimum edge-cut (8 processes)

37

Mappings Based on Task Partitioning


Partitioning a task-dependency graph

Optimal partitioning for general task-dependency graph


NP-complete problem

Excellent heuristics exist for structured graphs

38

Mapping a Binary Tree Dependency Graph


Dependency graph for quicksort Task assignment to processes in a hypercube*

*hypercube: node numbers that differ in 1 bit are adjacent


39

Task Partitioning: Mapping a Sparse Graph


17 item to communicate

13 item to communicate

36

Hierarchical Mappings
Sometimes a single mapping is inadequate
e.g., task mapping of a binary tree cannot readily use a large number of processors (e.g. parallel quicksort).

Hierarchical approach
use a task mapping at the top level data partitioning within each level

42

Schemes for Dynamic Mapping


Dynamic mapping AKA dynamic load balancing
load balancing is the primary motivation for dynamic mapping

Styles
centralized distributed

44

Centralized Dynamic Mapping


Processes = master(s) or slaves General strategy
when a slave runs out of work request more from master

Challenge
master may become bottleneck for large # of processes

Approach
chunk scheduling: process picks up several of tasks at once however
- large chunk sizes may cause significant load imbalances - gradually decrease chunk size as the computation progresses

45

Distributed Dynamic Mapping


All processes as peers Each process can send or receive work from other processes
avoids centralized bottleneck

Four critical design questions


how are sending and receiving processes paired together? who initiates work transfer? how much work is transferred? when is a transfer triggered?

Ideal answers can be application specific Cilk uses a distributed dynamic mapping: work stealing
Distributed v.s. Shared Memory Architectures Suitability
-For message-passing computers the computation size should be >> the data size
46

Outline

Characteristics of tasks and interactions


task generation, granularity, and context characteristics of task interactions

Mapping techniques for load balancing


static mappings dynamic mappings

Methods for minimizing interaction overheads Parallel algorithm design templates

11

Minimizing Interaction Overheads (1)


Rules of thumb

Maximize data locality


dont fetch data you already have restructure computation to reuse data promptly

Minimize volume of data exchange


partition interaction graph to minimize edge crossings

Minimize frequency of communication


try to aggregate messages where possible

Minimize contention and hot-spots


use decentralized techniques (avoidance)

48

Minimizing Interaction Overheads (2)


Techniques

Overlap communication with computation


use non-blocking communication primitives
- overlap communication with your own computation - one-sided: prefetch remote data to hide latency

multithread code on a processor


- overlap communication with another threads computation

Replicate data or computation to reduce communication Use group communication instead of point-to-point primitives Issue multiple communications and overlap their latency
(reduces exposed latency)

49

Outline

Characteristics of tasks and interactions


task generation, granularity, and context characteristics of task interactions

Mapping techniques for load balancing


static mappings dynamic mappings

Methods for minimizing interaction overheads Parallel algorithm design templates

11

Parallel Algorithm Model


Definition: ways of structuring a parallel algorithm Aspects of a model
decomposition mapping technique strategy to minimize interactions

51

Common Parallel Algorithm Models


Data parallel
each task performs similar operations on different data typically statically map tasks to processes

Task graph
use task dependency graph relationships to
- promote locality, or reduce interaction costs

Master-slave
one or more master processes generate work allocate it to worker processes allocation may be static or dynamic

Pipeline / producer-consumer
pass a stream of data through a sequence of processes each performs some operation on it

Hybrid
apply multiple models hierarchically, or apply multiple models in sequence to different phases

52

References
Slides originally from John Mellor-Crummey (Rice), COMP 422
Adapted from slides Principles of Parallel Algorithm Design by Ananth Grama Based on Chapter 3 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003

46

You might also like