Con Currency Mapping
Con Currency Mapping
Outline
11
Characteristics of Tasks
Key characteristics
generation strategy associated work associated data size
12
Task Generation
Static task generation
identify concurrent tasks a-priori typically data or recursive decomposition leads to static tasks generation examples
- matrix operations - graph algorithms - image processing applications
- other regularly structured problems
recursive can also lead to dynamic tasks generation (quicksort) exploratory can also lead to static tasks generation (15-puzzle)
13
Task Sizes
Task Size: amount of time required for completion Uniform: all the same size (example?) Non-uniform
sometimes sizes are known or can be estimated a-priori sometimes not
- example: tasks in quicksort
Implications on mapping?
14
Implications
small data: task can easily migrate to another process large data: ties the task to a process
- possibly can avoid communicating the task context
reconstruct/recompute the context elsewhere
15
Static vs. dynamic Regular vs. irregular Read-only vs. read-write One-sided vs. two-sided
16
Dynamic interactions
timing or interacting tasks cannot be determined a-priori harder to code
- especially using two-sided message passing APIs
17
Irregular interactions
lack a well-defined topology modeled by a graph
18
19
A task must scan its associated row(s) of A to know which entry -of vector b- it requires (implies the tasks it needs to interact with)
20
Read-write interactions
read and modify data associated with other tasks
example: shared tasks priority queues
21
Two-sided
both tasks coordinate in an interaction
- SEND + RECV
22
Outline
11
Mapping Techniques
Map concurrent tasks to processes for execution Goal: all tasks complete in the shortest possible time
Overheads of mappings
serialization (idling) - due to uneven load balancing/dependencies communication
A good mapping tries to minimize both sources of overheads Conflicting objectives: minimizing one increases the other
assigning all work to one processor (going to the extreme)
- minimizes communication - significant idling
24
Time
Time
25
Static mapping
a-priori mapping of tasks to processes requirements
- a good estimate of task size - even so, optimal mapping may be NP complete
Dynamic mapping
map tasks to processes at runtime why?
- tasks are generated at runtime, or
need to make sure cost of moving data doesnt outweigh the Factors that influence choice of mapping benefit of dynamic mapping size of data associated with a task nature of underlying domain
26
27
28
29
Partition the output matrix C using a block decomposition Give each task the same number of elements of C
each element of C corresponds to a dot product even load balance
30
Steps
1. partition an array into many more blocks than the number of available processes 2. assign blocks to processes in a round-robin manner
- each process gets several non-adjacent blocks
34
Block-Cyclic Distribution
(a) 1D block-cyclic
(b) 2D block-cyclic
However, some problems utilize sparse matrices and have datadependent and irregular interaction patters Sparse-matrix vector multiply
37
38
13 item to communicate
36
Hierarchical Mappings
Sometimes a single mapping is inadequate
e.g., task mapping of a binary tree cannot readily use a large number of processors (e.g. parallel quicksort).
Hierarchical approach
use a task mapping at the top level data partitioning within each level
42
Styles
centralized distributed
44
Challenge
master may become bottleneck for large # of processes
Approach
chunk scheduling: process picks up several of tasks at once however
- large chunk sizes may cause significant load imbalances - gradually decrease chunk size as the computation progresses
45
Ideal answers can be application specific Cilk uses a distributed dynamic mapping: work stealing
Distributed v.s. Shared Memory Architectures Suitability
-For message-passing computers the computation size should be >> the data size
46
Outline
11
48
Replicate data or computation to reduce communication Use group communication instead of point-to-point primitives Issue multiple communications and overlap their latency
(reduces exposed latency)
49
Outline
11
51
Task graph
use task dependency graph relationships to
- promote locality, or reduce interaction costs
Master-slave
one or more master processes generate work allocate it to worker processes allocation may be static or dynamic
Pipeline / producer-consumer
pass a stream of data through a sequence of processes each performs some operation on it
Hybrid
apply multiple models hierarchically, or apply multiple models in sequence to different phases
52
References
Slides originally from John Mellor-Crummey (Rice), COMP 422
Adapted from slides Principles of Parallel Algorithm Design by Ananth Grama Based on Chapter 3 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003
46