Lecture4 PDF
Lecture4 PDF
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 1 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 2 / 46
A pattern language for parallel programs Finding concurrency in a given problem - deep dive
Finding Concurrency
Structure the given problem to expose exploitable concurrency.
Algorithm Structure
Implementation Mechanisms
Helps algorithm to be implemented.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 3 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 4 / 46
Decomposition Patterns Task decomposition: An approach
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 5 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 6 / 46
C =A×B
N−1
X
Ci,j = Ai,k × Bk,j
k=0
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 7 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 8 / 46
Data decomposition: Design Data decomposition: Matrix multiplication example
Besides identifying the “resource” intensive parts, identify the key
data structures required to solve the problem, and how is the data C =A×B
used during the solution.
N−1
Q: Is the decomposition suitable to a specific system or many X
Ci,j = Ai,k × Bk,j
systems?
k=0
Q: Does it scale with the size of parallel computer?
Are similar operations applied to different parts of data, “Resource” intensive parts?
independently? Data chunks in the problem?
Does it scale with the size of parallel computers?
Are there different chunks of data that can be distributed?
Operations (Reads/Writes) applied on independent parts of data?
Relation between decomposition and ease of programming,
Data chunks big enough to deem the thread activity beneficial?
debugging and maintenance.
How to decompose?
Examples:
Each row/column of Ci,j is computed in a different task.
Array based computations: concurrency defined in terms of
Each column of Ci,j is computed in a different task.
updates of different segments of the array/matrix.
Recursive data structures: concurrency by decomposing the Performance? Cache effect?
parallel updates of a large tree/graph/linked list. *
Note: Data decomposition also leads to task decomposition as*
well.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 9 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 10 / 46
A1,1 A1,2 B1,1 B1,2
C = ×
A2,1 A2,2 B2,1 B2,2
A1,1 × B1,1 + A1,2 × B2,1 A1,1 × B1,2 + A1,2 × B2,2
=
A2,1 × B1,1 + A2,2 × B2,1 A2,1 × B1,2 + A2,2 × B2,2
Advantages
Can fit in the blocks into cache.
Can scale as per the hardware.
Overlap of communication and computation.
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 11 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 12 / 46
Dependence analysis for managing parallelism: Dependence analysis for managing parallelism:
Grouping Ordering
Background: Tasks and Data decomposition has been done. Background: Tasks and Data decomposition has been done.
All the identified tasks may not run in parallel. Dependent tasks have been grouped together.
Q: How should related tasks be grouped to help manage the Ordering of the tasks and groups not trivial.
dependencies?
Dependent, related tasks should be (uniquely?) grouped together. Q: How should the groups be ordered to satisfy the constraints
among the groups and in turn tasks?
Temporal dependency: If task A depends on the result of task B, Dependent groups+tasks should be ordered to preserve the
then A must wait for the results from B. Q: Does A have to wait for original semantics.
B to terminate? Should not be overly restrictive.
Concurrent dependency: Tasks are expected to run in parallel, and Ordering is imposed by: Data + Control dependencies.
one depends on the updates of the other. Ordering can also be imposed by external factors: network, i/o and
Independent tasks: Can run in parallel or in sequence. Is it always so on.
better to run them in parallel? Ordering of independent tasks?
Advantage of grouping. Importance of grouping.
Grouping enforces partial orders between tasks. Ensures the program semantics.
Application developer thinks of groups, instead of individual tasks. A key step in program design.
* *
Example: Computing of individual rows.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 13 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 14 / 46
Background: Tasks and Data decomposition has been done. Identify the data being shared - directly follows from the
Dependent tasks have been grouped together. The ordering between decomposition.
the groups and tasks have been identified. If sharing is done incorrectly - a task may get invalid data due to
Groups and tasks have some level of dependency among each race condition.
other. A naive way to guarantee correct shared data: synchronize every
Q: How is data shared among the tasks? read with barriers.
Synchronization of data across different tasks - may require
Identify the data updated/needed by individual tasks - task local communication. Options:
data.
Overlap of communication and computation.
Some data may be updated by multiple tasks - global data. Privatization.
Some data may be updated by one data used by multiple tasks - keep local copies of shared data.
remote data
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 15 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 16 / 46
One special case of sharing Finding concurrency in a given problem - deep dive
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 17 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 18 / 46
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 19 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 20 / 46
Algorithm Structure - deep dive Algorithm Structure design
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 21 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 22 / 46
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 23 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 24 / 46
Example: Task parallel algorithm Solution to Branch and Bound ILP
Machine Job1 Job2 Job3 Job4
M1 4 4 3 5
M2 2 3 4 4
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 25 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 26 / 46
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 27 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 28 / 46
Divide and Conquer pattern: features Divide and conquer - example Mergesort
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 29 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 30 / 46
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 31 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 32 / 46
Geometric decomposition: Matrix multiplication Algorithm Structure design
× B
C = A
A1,1 A1,2 B1,1 B1,2
= ×
A2,1 A2,2 B2,1 B2,2
A1,1 × B1,1 + A1,2 × B2,1 A1,1 × B1,2 + A1,2 × B2,2
=
A2,1 × B1,1 + A2,2 × B2,1 A2,1 × B1,2 + A2,2 × B2,2
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 33 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 34 / 46
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 35 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 36 / 46
Recursive Data structures: Parallel find roots Parallelizing recursive data structures
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 37 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 38 / 46
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 39 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 40 / 46
Pipeline pattern features Pipieline pattern. Issues
Error handling.
Create a separate task for error handling - which will run exception
routines.
Once the pipeline is full maximum parallelism is observed. Processor allocation, load balancing
Number of stages should be small compared to the number of Throughput and Latency.
items processed.
Efficiency improves if time taken in each stage is roughly the
same. Else?
Amount of concurrency depends on the number of stages.
Too many stages, disadvantage?
Communication across stages? * *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 41 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 42 / 46
Challenges
Identifying the tasks.
Identifying the events flow.
Enforcing the events ordering.
Avoiding deadlock.
Efficient communication of events.
Left for self reading.
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 43 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 44 / 46
Algorithm Structure design Supporting structures
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 45 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 46 / 46
} return 0; }
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 49 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 50 / 46
Situation
workload at each task is variable and unpredictable (what if
predictable?).
Not easy to map to loop-based computation.
The underlying hardware have different capacities.
Master/Worker pattern
Has a logical master, and one or more instances of workers.
Computation by each worker may vary.
The master starts computation and creates a set of tasks.
Master waits for tasks to get over.
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 51 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 52 / 46
Master/Worker layout Master/Worker Issues
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 53 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 54 / 46
consumeResults (nTasks);
} * *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 55 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 56 / 46
Master/Worker - template for worker Supporting structure
class Worker(){
public void run() {
while (!(Master.taskQueue.empty()){
// atomically dequeue.
// do computation.
// add to globalResults atomically
} } }
Known uses
SETI@HOME
Map Reduce
”Map” step: The master node takes the input, partitions it up into
smaller sub-problems, and distributes those to worker nodes.
A worker may again partition the problem – multi-level tree structure.
The worker node processes that smaller problem, and passes the
answer back to its master node.
”Reduce” step: Master node takes all the answers and combines
them to get the output the answer to the original problem. * *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 57 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 58 / 46
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 59 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 60 / 46
Loop parallelization issues Loop parallelization example
Z 1
4
Distributed memory architectures. π= dx
0 1 + xx
False sharing : variables not needed to be shared, but are in the
same same cache line. Can incur high overheads.
int main () {
Seen in systems with distributed, coherent caches. int main () {
The caching protocol may force the reload of a cache line despite int i,numSteps=1000000;
int i,numSteps = 1000000;
a lack of logical necessity. double pi,step,sum=0.0;
double x,pi,step,sum=0.0;
step=1.0/(double)numSteps;
step=1.0/(double)numSteps;
foreach(j : [0..N]) {
foreach(j : [0..N]) { double tmp; forall(i: [0..numSteps]){
for(i: [0..numSteps]){
for(i=0; i<M; i++){ for(i=0; i<M; i++){ double x=(i+0.5)*step;
x=(i+0.5)*step;
double tmp=4.0/(1.0+x*x)
A[j]+= compute(j,i); tmp += compute(j,i); sum=sum+4.0/(1.0+x*x);}
atomic sum=sum+tmp; }
} }
} atomic A[j] += tmp; pi=step*sum;
pi = step * sum;
} printf("pi %lf\n",pi);
printf("pi %lf\n",pi);
return 0; }
*
return 0; } *
Reading material: Automatic loop parallelization.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 61 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 62 / 46
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 63 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 64 / 46
Fork/Join - example Mergesort Supporting structures and algorithm structure
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 65 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 66 / 46
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 67 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 68 / 46
Issues with shared data Issues with shared data
Data race and interference: Two shared activities access a shared Deadlocks : two or more competing actions are each waiting for
data. And at least one of them is a write. The activities said to the other to finish.
lockA → lockB
interfere. (Example via nested locks)
lockB → lockA
forall (i:[1..n]) {
One way to avoid: partial order among locks. Locks are acquired
sum += A[i];
in an order respecting the partial order.
}
Livelocks : the states of the processes involved in the livelock
for (i[1..n]) { constantly change with regard to one another, none progressing.
forall (j=1;j<m;++j) { Example: recovery from deadlock - If more than one process
A[i][j]=(A[i-1][j-1]+A[i-1][j]+A[i-1][j+1])/3; takes action, the deadlock detection algorithm can be repeatedly
} triggered leading to a livelock
}
Locality : Trivial if data is not shared.
Dependencies : Use synchronization (locks, barriers, atomics, . . . ) Memory synchronization: when memory / cache is distributed.
to enforce the dependencies.
How to implement all-to-all synchronization?
Task scheduling - tasks might be suspended for access to shared
*
data. Minimize the wait. *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 69 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 70 / 46
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 71 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 72 / 46
Supporting structure Implementation Mechanisms
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 73 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 74 / 46
done=true;
done = false;
UE - unit of execution (a process / thread / activity) while(done) ;
Difference between process / thread / activity. Value may be present in cache. cache coherence may take care.
Value may be present in a register - Culprit compiler.
Management = Creation, execution, termination. Value may not be read. How?
Varies with different underlying languages.
Go back to first few lectures for a recap. x=y=0
Thread 1 Thread 2
1: r1 = x 4: x = 1
2: y = 1 r3 = y
3: r2 = x
r1 == r2 == r3 == 0. Possible?
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 75 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 76 / 46
Synchronization: Memory synchronization and fences Syncrhonization: Barriers
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 77 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 78 / 46
1 * 2 *
Thanks - Jun Shirako Thanks - Jun Shirako
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 79 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 80 / 46
Syncrhonization Implementation Mechanisms
Memory fence
Barriers
Mutual exclusion: Java synchronized, omp set lock,
omp unset lock.
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 81 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 82 / 46
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 83 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 84 / 46
Serial reduction Tree based reduction
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 85 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 86 / 46
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 87 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 88 / 46
Overall big picture Sources
* *
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 89 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 90 / 46