0% found this document useful (0 votes)
83 views

Lecture4 PDF

This document discusses patterns for parallel programming. It describes a pattern language as a structured method for describing good design practices within a field of expertise. The document outlines an approach for finding concurrency in a problem by decomposing it into tasks and data. It provides examples of task and data decomposition for matrix multiplication, identifying independent parts that can run concurrently.

Uploaded by

Sheda17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

Lecture4 PDF

This document discusses patterns for parallel programming. It describes a pattern language as a structured method for describing good design practices within a field of expertise. The document outlines an approach for finding concurrency in a problem by decomposing it into tasks and data. It provides examples of task and data decomposition for matrix multiplication, identifying independent parts that can run concurrently.

Uploaded by

Sheda17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

A pattern language

CS6868 - Concurrent Programming Pattern: “a careful description of a perennial solution to a


Design patterns for parallel programs recurring problem within a . . . context.”
Origin Christopher Alexander, 1977 in the context of design and
construction of building and town.
V. Krishna Nandivada Patterns in software engineering: Beck and Cunningham (1987),
IIT Madras
Gamma, Helm, Johnson, Vlissides (1995).
Pattern Language: a structured method of describing good
design practices within a field of expertise.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 1 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 2 / 46

A pattern language for parallel programs Finding concurrency in a given problem - deep dive
Finding Concurrency
Structure the given problem to expose exploitable concurrency.
Algorithm Structure

Structuring the algorithm to take advantage of potential concurrency.


Supporting Structures

Implementation Mechanisms
Helps algorithm to be implemented.

How the high level specifications are mapped.


Goal: Identify patterns in each stage. * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 3 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 4 / 46
Decomposition Patterns Task decomposition: An approach

Identify “resource” intensive parts of the problem.


Task decomposition: A program to a sequence of “tasks”. Identify different tasks that make up the problem. Challenge: write
Some of the tasks can run in parallel. the algorithms and run the tasks concurrently.
Independent the tasks the better. Sometimes the problem will naturally break into a collection of
(nearly) independent tasks. Sometimes, not!
Data decomposition: Focus on the data used by the program.
Q: Are there enough tasks to keep the map all the H/W cores?
Decompose the program into tasks based on distinct chunks of
data. Q: Does each task have enough work to keep the individual cores
Efficiency depends on the independence of the chunks. busy?
Q: Are the number of tasks dependent or independent of the
Task decomposition may lead to data decomposition and vice number of H/W core?
versa. Q: Are these tasks relatively independent?
Q: Are they really independent? Instances of tasks: Independent modules, loop iterations.
Relation between tasks and ease of programming, debugging and
*
maintenance. *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 5 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 6 / 46

Task decomposition: Matrix multiplication example Finding concurrency in a given problem

C =A×B

N−1
X
Ci,j = Ai,k × Bk,j
k=0

“Resource” intensive parts?


Tasks in the problem?
Are tasks independent? Enough tasks for all the cores? Enough
work for each task? Size of tasks and number of cores?
Each element Ci,j is computed in a different task - row major.
Each element Ci,j is computed in a different task - column major.
Each element Ci,j is computed in a different task - diagonals.
How to reason about Performance? Cache effect? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 7 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 8 / 46
Data decomposition: Design Data decomposition: Matrix multiplication example
Besides identifying the “resource” intensive parts, identify the key
data structures required to solve the problem, and how is the data C =A×B
used during the solution.
N−1
Q: Is the decomposition suitable to a specific system or many X
Ci,j = Ai,k × Bk,j
systems?
k=0
Q: Does it scale with the size of parallel computer?
Are similar operations applied to different parts of data, “Resource” intensive parts?
independently? Data chunks in the problem?
Does it scale with the size of parallel computers?
Are there different chunks of data that can be distributed?
Operations (Reads/Writes) applied on independent parts of data?
Relation between decomposition and ease of programming,
Data chunks big enough to deem the thread activity beneficial?
debugging and maintenance.
How to decompose?
Examples:
Each row/column of Ci,j is computed in a different task.
Array based computations: concurrency defined in terms of
Each column of Ci,j is computed in a different task.
updates of different segments of the array/matrix.
Recursive data structures: concurrency by decomposing the Performance? Cache effect?
parallel updates of a large tree/graph/linked list. *
Note: Data decomposition also leads to task decomposition as*

well.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 9 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 10 / 46

Matrix multiplication: Data decomposition. Finding concurrency in a given problem

   
A1,1 A1,2 B1,1 B1,2
C = ×
A2,1 A2,2 B2,1 B2,2
 
A1,1 × B1,1 + A1,2 × B2,1 A1,1 × B1,2 + A1,2 × B2,2
=
A2,1 × B1,1 + A2,2 × B2,1 A2,1 × B1,2 + A2,2 × B2,2

Advantages
Can fit in the blocks into cache.
Can scale as per the hardware.
Overlap of communication and computation.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 11 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 12 / 46
Dependence analysis for managing parallelism: Dependence analysis for managing parallelism:
Grouping Ordering
Background: Tasks and Data decomposition has been done. Background: Tasks and Data decomposition has been done.
All the identified tasks may not run in parallel. Dependent tasks have been grouped together.
Q: How should related tasks be grouped to help manage the Ordering of the tasks and groups not trivial.
dependencies?
Dependent, related tasks should be (uniquely?) grouped together. Q: How should the groups be ordered to satisfy the constraints
among the groups and in turn tasks?
Temporal dependency: If task A depends on the result of task B, Dependent groups+tasks should be ordered to preserve the
then A must wait for the results from B. Q: Does A have to wait for original semantics.
B to terminate? Should not be overly restrictive.
Concurrent dependency: Tasks are expected to run in parallel, and Ordering is imposed by: Data + Control dependencies.
one depends on the updates of the other. Ordering can also be imposed by external factors: network, i/o and
Independent tasks: Can run in parallel or in sequence. Is it always so on.
better to run them in parallel? Ordering of independent tasks?
Advantage of grouping. Importance of grouping.
Grouping enforces partial orders between tasks. Ensures the program semantics.
Application developer thinks of groups, instead of individual tasks. A key step in program design.
* *
Example: Computing of individual rows.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 13 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 14 / 46

Dependence analysis for managing parallelism: data Issues in data sharing


sharing

Background: Tasks and Data decomposition has been done. Identify the data being shared - directly follows from the
Dependent tasks have been grouped together. The ordering between decomposition.
the groups and tasks have been identified. If sharing is done incorrectly - a task may get invalid data due to
Groups and tasks have some level of dependency among each race condition.
other. A naive way to guarantee correct shared data: synchronize every
Q: How is data shared among the tasks? read with barriers.
Synchronization of data across different tasks - may require
Identify the data updated/needed by individual tasks - task local communication. Options:
data.
Overlap of communication and computation.
Some data may be updated by multiple tasks - global data. Privatization.
Some data may be updated by one data used by multiple tasks - keep local copies of shared data.
remote data

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 15 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 16 / 46
One special case of sharing Finding concurrency in a given problem - deep dive

Accumulation/Reduction: Data being used to accumulate a result;


sum, minimum, maximum, variance etc.
Each core has a separate copy of data,
accumulation happens in these local copies.
sub-results are further used to compute the final result.
Example: Sum elements in an array A[1024]
Decompose the array into 32 chunk.
Accumulate each chunk separately.
Accumulate the sub results into the global “sum”.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 17 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 18 / 46

Managing parallelism - design evaluation Design evaluation factors

Suitability to the target platform (at a high level)


Number of cores / HW threads - too few/many tasks?
Homogeneous/Heterogeneous multi-cores? And work distribution.
Background: Tasks and Data decomposition has been done. Data distribution among the cores - equal/unequal?
Dependent tasks have been grouped together. The ordering between Cost of communication - fine/coarse grained data sharing.
the groups and tasks have been identified. A scheme for data sharing Amount of sharing - shared memory or distributed memory.
has also been identified. Metrics: simplicity (qualitative) , Efficiency , Flexibility
Of the multiple choices present at different points, we have chosen Flexibility
one. Flexible/Parametric over the number of cores/threads?
Flexible/Parametric over the number and size of data chunks?
Q: Is the chosen path a “good” one?
Does it handle boundary cases?
Efficiency.
Even load balancing?
Minimum overhead? - task creation, synchronization,
communication.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 19 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 20 / 46
Algorithm Structure - deep dive Algorithm Structure design

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 21 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 22 / 46

Task Parallelism Factors in efficient Task parallel algorithm design

Q: A problem is best decomposed into a collection of tasks that can


execute concurrently. How to exploit the concurrency efficiently? Tasks:
Problem can be decomposed into a collection of concurrent tasks.
1 Enough Tasks to keep the cores busy.
2 Advantage of creating the tasks should offset the overhead of
Tasks can be completely independent or can have dependencies. creating and managing them.
Tasks can be known from the beginning (producer/consumer), Dependencies
tasks are created dynamically. 1 Ordering constraints.
Solution may or not require all the tasks to finish. 2 Dependencies from shared data: synchronization, private data.
3 Schedule: creation and scheduling.
Challenges: Schedule
Assign tasks to cores - to result in a simple, flexible and efficient 1 How are the tasks assigned to cores.
execution. 2 How are the tasks scheduled.
Address the dependencies correctly.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 23 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 24 / 46
Example: Task parallel algorithm Solution to Branch and Bound ILP
Machine Job1 Job2 Job3 Job4
M1 4 4 3 5
M2 2 3 4 4

Maintain a list of tasks.


Remove a solution from the list.
Examine the solution. Either discard it or declare it a
solution, or add a sub-problem to task list.
The tasks depend depend on each other through the task-list.

Say Job1 to M1, Job2 to M2, Job3 to M1, Job4 to M2 = 7.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 25 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 26 / 46

Algorithm Structure design Divide and conquer

Q: Tasks are created recursively to solve a problem in a divide conquer


strategy. How to exploit the concurrency?
Divide and Conquer: Problem is solved by splitting it into a
number of smaller subproblems. Examples?
Each subproblems can be solved “fairly” independently. Directly or
further divide and conquer.
Solutions of the smaller problems is merged to compute the final
solution.
Each divide doubles the concurrency.
Each merge halves the concurrency.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 27 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 28 / 46
Divide and Conquer pattern: features Divide and conquer - example Mergesort

int[] mergesort(int[]A,int L,int H){


if (H - L <= 1) return;
if (H-L <= T ) {quickSort(A, L, H); return;}
int m = (L+H)/2;
A1 = mergesort(A, L, m);
A2 = mergesort(A, m+1, H);
return merge(A1, A2);
The amount of exploitable concurrency varies. // returns a merged sorted array.
At the beginning and end very little exploitable concurrency. }
Note: “split” and “merge” are serial parts. split cost?
Amdahl’s law - speed up constrained by the serial part. Impact?
merge cost?
Too many parallel threads?
What if cores are distributed? - data movement? Value of threshold T ?
Tasks are created dynamically - load balancing?
What if the sub-problems are not equal-sized? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 29 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 30 / 46

Algorithm Structure design Geometric decomposition


Q: How can an algorithm be organized around a data structure that
has been decomposed into concurrently updatable “chunks”?
Similar to decomposing a geometric region into subregions.
Linear Data structures (such as arrays) - can be often
decomposed into contiguous sub-structures.
These individual tasks are processed in different concurrent tasks.
Note: Sometimes all the required data for a task is present
“locally” (embarrassingly parallel - Task parallelism pattern). And
sometimes share data with “neighboring” chunks.
Challenges
Ensure that each task has access to all data it needs.
Mapping of chunks to cores giving good performance. Q: Why is it
a challenge?
Granularity of decomposition (coarse or fine-grain) - effect on
efficiency? Parametric? Tweaked at compile time or runtime?
* Shape of the chunk: Regular/irregular? *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 31 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 32 / 46
Geometric decomposition: Matrix multiplication Algorithm Structure design

× B
C = A   
A1,1 A1,2 B1,1 B1,2
= ×
A2,1 A2,2 B2,1 B2,2
 
A1,1 × B1,1 + A1,2 × B2,1 A1,1 × B1,2 + A1,2 × B2,2
=
A2,1 × B1,1 + A2,2 × B2,1 A2,1 × B1,2 + A2,2 × B2,2

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 33 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 34 / 46

Recursive Data Pattern Recursive Data Pattern - example Find roots

Q: How can recursive data structures be partitioned so as that


operations on them are performed in parallel?
Linked list, tree, graphs . . .
Inherently operations on recursive data structures are serial - as
one has to sequentially move through the data structure.
For example linked list traversal or traversing a binary tree.
Sometimes it is possible to reshape operations to derive and Given a forest of rooted trees: compute the root of each node.
exploit concurrency. Serial version: Do a depth-first or breadth first traversal from root
to the leaf nodes.
For each visited node - set the root. Total running time?
Q: Is there concurrency?
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 35 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 36 / 46
Recursive Data structures: Parallel find roots Parallelizing recursive data structures

Recasting the problem increases the cost. Find a way to get it


back.
Effective exploitation of the derived concurrency depends on
factors such as - amount of work available for each task, amount
of serial code . . .
Restructuring may make the solution complex.
Requirement of synchronization - Why?
Another example: Find partial sums in a linked list.
Transformed the original serial computation to one where we
compute partial result and repeatedly combine partial results.
Total Cost = ?
Total cost = O(N log N)
However, if we exploit the parallelism - running time will come
down to O(log N). * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 37 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 38 / 46

Algorithm Structure design Pipieline pattern


Q: The computation may involve performing similar sets of operations
on many sets of data. Is there concurrency? How to exploit it?
Factory assembly line, Network Packet processing, Instruction
processing in CPUs etc.

There are ordering constraints on each operation on any one set


of data: Operation C2 can be undertaken only after C1 .
Key requirement: Number of operations > 1.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 39 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 40 / 46
Pipeline pattern features Pipieline pattern. Issues

Error handling.
Create a separate task for error handling - which will run exception
routines.
Once the pipeline is full maximum parallelism is observed. Processor allocation, load balancing
Number of stages should be small compared to the number of Throughput and Latency.
items processed.
Efficiency improves if time taken in each stage is roughly the
same. Else?
Amount of concurrency depends on the number of stages.
Too many stages, disadvantage?
Communication across stages? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 41 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 42 / 46

Overall big picture Event based coordination

Challenges
Identifying the tasks.
Identifying the events flow.
Enforcing the events ordering.
Avoiding deadlock.
Efficient communication of events.
Left for self reading.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 43 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 44 / 46
Algorithm Structure design Supporting structures

We have identified concurrency, and established an algorithm


structure.
Now how to implement the algorithm?
Issues
Clarity of abstraction - from algorithm to source code.
Scalability - how many processors can it use?
Efficiency - utilizing the resource of the computer, efficiently.
Example?
Maintainability - is it easy to debug, verify and modify?
Environment - hardware and programming environment.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 45 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 46 / 46

SPMD pattern SPMD example


Z 1
4
π= dx
0 1 + xx
Each UE executes the same program, but has different data.
int main () {
They can follow different paths through the program. How? // Initialization start
Code at different UEs can differentiate with each other using a int i;
unique ID. int numSteps = 1000000;
double x, pi, step, sum = 0.0;
Assumes that each underlying hardware are similar. step = 1.0/(double) numSteps;
Challenges // Initialization end
for (i=0;i< numSteps; i++) {
Interactions among the seemingly independent activities of UEs. x = (i+0.5)*step;
Clarity, Scalability, Efficiency, Maintainability (1m cores), sum = sum + 4.0/(1.0+x*x); }
Environment. // Finalization start
pi = step * sum;
How to handle code like initialization, finalization etc? printf("pi %lf\n",pi);
return 0;
// Finalization end
* *
}
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 47 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 48 / 46
SPMD translation. Inefficient? SPMD translation. Better?
int main () { int main () {
int i; int numSteps = 1000000;
int i; double x, pi, step, sum = 0.0;
int numSteps = 1000000; step = 1.0/(double) numSteps;
double x, pi, step, sum = 0.0; int numProcs = getNumProcs();
step = 1.0/(double) numSteps; int myID = getMyId();
int numProcs = numSteps; step = 1.0/numSteps;
int myID = getMyId();
iStart = myID * (numSteps / numprocs);
i = myID; iEnd = iStart * (numSteps / numprocs);
x = (i+0.5)*step; if (myID == numProcs-1) iEnd = numSteps;
sum = sum + 4.0/(1.0+x*x);
for (i = iStart; i < iEnd; ++i){
sum = step * sum; x = (i+0.5)*step;
DoReductionOverAllProcs(&sum, &pi); // blocking. sum = sum + 4.0/(1.0+x*x); }
if (myID == 0) printf("pi %lf\n",pi); sum = step * sum;
return 0; DoReductionOverAllProcs(&sum, &pi); // blocking.
*
if (myID == 0) printf("pi %lf\n",pi); *

} return 0; }
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 49 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 50 / 46

Supporting structure Master/Worker

Situation
workload at each task is variable and unpredictable (what if
predictable?).
Not easy to map to loop-based computation.
The underlying hardware have different capacities.
Master/Worker pattern
Has a logical master, and one or more instances of workers.
Computation by each worker may vary.
The master starts computation and creates a set of tasks.
Master waits for tasks to get over.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 51 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 52 / 46
Master/Worker layout Master/Worker Issues

Has good scalability, if number of tasks greatly exceed the number


of workers, and each worker roughly gets the same amount of
work (Why?).
Size of tasks should not be too small. Why?
Can work with any hardware platform.
How to detect completion? When can the workers not wait but
shutdown?
Easy if all tasks are ready before workers start.
Use of a poison-pill in the work-queue.
What if the workers can also add tasks? Issues?
Issues with asynchronous message passing systems?
How to handle fault tolerance? - did the task finish?
Variations
Master can also become a worker.
Distributed task queue instead of a centralized task queue.
(dis)advantages?
Q: How to implement the set of tasks? Characteristics of this data structure? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 53 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 54 / 46

Master/Worker template for master Master/Worker - ForkJoin


int nTasks // Number of tasks
int nWorkers // Number of workers
public static SharedQueue taskQueue; // global task queue
public static SharedQueue resultsQueue; // queue to hold results
void master() { void ForkJoin(int nWorker){
// Create and initialize shared data structures Thread [] t = new Threads[nWorkers];
taskQueue = new SharedQueue(); for (int i=0;i<nWorker;++i) {
globalResults = new SharedQueue(); t[i] = new Thread(new Worker()) }
for (int i = 0; i < nTasks; i++) for (int i=0;i<nWorker;++i) {
enqueue(taskQueue, i); t[i].join();}
}
// Create nWorkers threads
ForkJoin (nWorkers);

consumeResults (nTasks);
} * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 55 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 56 / 46
Master/Worker - template for worker Supporting structure

class Worker(){
public void run() {
while (!(Master.taskQueue.empty()){
// atomically dequeue.
// do computation.
// add to globalResults atomically
} } }
Known uses
SETI@HOME
Map Reduce
”Map” step: The master node takes the input, partitions it up into
smaller sub-problems, and distributes those to worker nodes.
A worker may again partition the problem – multi-level tree structure.
The worker node processes that smaller problem, and passes the
answer back to its master node.
”Reduce” step: Master node takes all the answers and combines
them to get the output the answer to the original problem. * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 57 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 58 / 46

Loop Parallelization Loop coalescing and merging for parallelization

A program has many computationally intensive loops, with Merging/Fusion


Coalescing
“independent” iterations. for (i : 1..n) {
Goal: Parallelize the loops and get most of the benefits. for (i : 0..m) {
S1
for (j : 0..n) {
Very narrow focus. }
S
Typical application: scientific and high performance computation. for (j : 1..n) {
}
S2
Impact of Amdahl’s law? -->
}
Quite amenable to refactoring type of incremental parallelization. for (ij : 0..m*n) {
-->
Advantage? j = ij % n;
for (i : 1..n) {
i = ij / n;
Impact on distributed memory systems? S1
S
Good if computation done in iterations compensates the cost of j = i;
}
thread creation - how to improve the tradeoff? Coalescing, S2
merging. }

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 59 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 60 / 46
Loop parallelization issues Loop parallelization example
Z 1
4
Distributed memory architectures. π= dx
0 1 + xx
False sharing : variables not needed to be shared, but are in the
same same cache line. Can incur high overheads.
int main () {
Seen in systems with distributed, coherent caches. int main () {

The caching protocol may force the reload of a cache line despite int i,numSteps=1000000;
int i,numSteps = 1000000;
a lack of logical necessity. double pi,step,sum=0.0;
double x,pi,step,sum=0.0;
step=1.0/(double)numSteps;
step=1.0/(double)numSteps;
foreach(j : [0..N]) {
foreach(j : [0..N]) { double tmp; forall(i: [0..numSteps]){
for(i: [0..numSteps]){
for(i=0; i<M; i++){ for(i=0; i<M; i++){ double x=(i+0.5)*step;
x=(i+0.5)*step;
double tmp=4.0/(1.0+x*x)
A[j]+= compute(j,i); tmp += compute(j,i); sum=sum+4.0/(1.0+x*x);}
atomic sum=sum+tmp; }
} }
} atomic A[j] += tmp; pi=step*sum;
pi = step * sum;
} printf("pi %lf\n",pi);
printf("pi %lf\n",pi);
return 0; }
*
return 0; } *
Reading material: Automatic loop parallelization.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 61 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 62 / 46

Supporting structure Fork/Join

The number of concurrent tasks varies as the program executes.


Parallelism beyond just loops.
Tasks created dynamically (beyond master-worker).
One or more tasks waits for the created tasks to terminate.
Each task may or not result in an actual UE. Many-to-one
mapping. Examples?

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 63 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 64 / 46
Fork/Join - example Mergesort Supporting structures and algorithm structure

int[] mergesort(int[]A,int L,int H){


if (H-L <= T ) {quickSort(A, L, H); return;}
int m = (L+H)/2;
A1 = mergesort(A, L, m); // fork
A2 = mergesort(A, m+1, H); // fork
// join.
return merge(A1, A2);
// returns a merged sorted array. Homework
} OpenMP MPI Java X10 UPC Cilk Hadoop
SPMD ??? ???? ??
Issues Loop Parallelism ???? ? ???
Cost. Master/Worker ?? ??? ???
Alternatives? Fork/Join ??? ????

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 65 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 66 / 46

Supporting structure Shared Data

Million dollar question: How to handle shared data?


Managing shared data incurs overhead.
Scalability can become an issue.
Can lead to programmability issues.
Avoid if possible - by
replication,
privatization,
reduction.
Use appropriate concurrency control. Why?
Should preserve the semantics.
Should not be too conservative.
Shared data organization: distributed or at a central location?
Shared Queue (remember master-worker?) is a type of shared
data.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 67 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 68 / 46
Issues with shared data Issues with shared data

Data race and interference: Two shared activities access a shared Deadlocks : two or more competing actions are each waiting for
data. And at least one of them is a write. The activities said to the other to finish.
lockA → lockB
interfere. (Example via nested locks)
lockB → lockA
forall (i:[1..n]) {
One way to avoid: partial order among locks. Locks are acquired
sum += A[i];
in an order respecting the partial order.
}
Livelocks : the states of the processes involved in the livelock
for (i[1..n]) { constantly change with regard to one another, none progressing.
forall (j=1;j<m;++j) { Example: recovery from deadlock - If more than one process
A[i][j]=(A[i-1][j-1]+A[i-1][j]+A[i-1][j+1])/3; takes action, the deadlock detection algorithm can be repeatedly
} triggered leading to a livelock
}
Locality : Trivial if data is not shared.
Dependencies : Use synchronization (locks, barriers, atomics, . . . ) Memory synchronization: when memory / cache is distributed.
to enforce the dependencies.
How to implement all-to-all synchronization?
Task scheduling - tasks might be suspended for access to shared
*
data. Minimize the wait. *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 69 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 70 / 46

Supporting structure Distributed Array

Arrays often are partitioned between multiple tasks.


Goal: Efficient code, programmability.
Distribute the arrays such that elelement needed by a task is
“available” and “nearby”.
Array element redistribution?
An abstraction is needed: a map from elements to places.
Some standard ones: Blocked, Cyclic, Blocked cyclic, Unique,
Chosing a distribution.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 71 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 72 / 46
Supporting structure Implementation Mechanisms

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 73 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 74 / 46

UE management Synchronization: Memory synchronization and fences

Synchronization: Enforces constraint among parallel events.

done=true;
done = false;
UE - unit of execution (a process / thread / activity) while(done) ;
Difference between process / thread / activity. Value may be present in cache. cache coherence may take care.
Value may be present in a register - Culprit compiler.
Management = Creation, execution, termination. Value may not be read. How?
Varies with different underlying languages.
Go back to first few lectures for a recap. x=y=0
Thread 1 Thread 2
1: r1 = x 4: x = 1
2: y = 1 r3 = y
3: r2 = x
r1 == r2 == r3 == 0. Possible?
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 75 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 76 / 46
Synchronization: Memory synchronization and fences Syncrhonization: Barriers

A memory fences guarantees that the UEs will see a consistent


view of memory.
Writes performed before the fence will be visible to reads Barrier is a synchronization point at which every member of a
performed after the fence. collection of UEs must arrive before any member can proceed.
Reads performed after the fence will obtain a value written no MPI Barrier, join, finish, clocks, phasers
earlier than the latest write before the fence. Implemented underneath via passing messages.
Only for shared memory.
Explicit management can be error prone. High level: OpenMP
flush, shared, Java - volatile. Read yourself.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 77 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 78 / 46

Phasers1 Power of Phaser - pipeline parallelism2

1 * 2 *
Thanks - Jun Shirako Thanks - Jun Shirako
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 79 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 80 / 46
Syncrhonization Implementation Mechanisms

Memory fence
Barriers
Mutual exclusion: Java synchronized, omp set lock,
omp unset lock.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 81 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 82 / 46

Communication Collective communication

When multiple UEs participate in a single communication event, the


UEs need to exchange information. event is called a collective communication operation. Examples:
Shared memory - easy. Challenge - synchronize the memory
access so that results are correct irrespective of scheduling. Broadcast: a mechanism to send single message to all UEs.
distributed memory - not much need for synchronization to protect Barriers : a synchronization point.
the resources. → Communication plays a big role. Reduction: Take a collection of objects, one from each UE, and
One to one communication : “combine” into a single value;
Between all UEs in one event: Collective communication. combined value present only on one UE?
combined value present on all UEs?

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 83 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 84 / 46
Serial reduction Tree based reduction

Reduction with 2n items takes n steps.


Reduction with n items takes n steps. What if number of UEs < number of data items?
Useful especially if the reduction operator is not associative. Only one UE knows the result.
Only one UE knows the result. * Associative + Commutative or don’t care (example?) *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 85 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 86 / 46

Recursive doubling Implementation Mechanisms

Reduction with 2 × n items takes n steps.


What if number of UEs < number of data items?
All UEs know the result.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 87 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 88 / 46
Overall big picture Sources

Patterns for Parallel Programming: Sandors, Massingills.


multicoreinfo.com
Wikipedia
fixstars.com
Jernej Barbic slides.
Loop Chunking in the presence of synchronization.
Java Memory Model JSR-133: “Java Memory Model and Thread
Specification Revision”

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 89 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 90 / 46

You might also like