0% found this document useful (0 votes)

88 views

Lecture4 PDF

This document discusses patterns for parallel programming. It describes a pattern language as a structured method for describing good design practices within a field of expertise. The document outlines an approach for finding concurrency in a problem by decomposing it into tasks and data. It provides examples of task and data decomposition for matrix multiplication, identifying independent parts that can run concurrently.

Uploaded by

Sheda17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views

Lecture4 PDF

Uploaded by

Sheda17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

A pattern language

CS6868 - Concurrent Programming Pattern: “a careful description of a perennial solution to a

Design patterns for parallel programs recurring problem within a . . . context.”
Origin Christopher Alexander, 1977 in the context of design and
construction of building and town.
V. Krishna Nandivada Patterns in software engineering: Beck and Cunningham (1987),
IIT Madras
Gamma, Helm, Johnson, Vlissides (1995).
Pattern Language: a structured method of describing good
design practices within a field of expertise.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 1 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 2 / 46

A pattern language for parallel programs Finding concurrency in a given problem - deep dive
Finding Concurrency
Structure the given problem to expose exploitable concurrency.
Algorithm Structure

Structuring the algorithm to take advantage of potential concurrency.

Supporting Structures

Implementation Mechanisms
Helps algorithm to be implemented.

How the high level specifications are mapped.

Goal: Identify patterns in each stage. * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 3 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 4 / 46
Decomposition Patterns Task decomposition: An approach

Identify “resource” intensive parts of the problem.

Task decomposition: A program to a sequence of “tasks”. Identify different tasks that make up the problem. Challenge: write
Some of the tasks can run in parallel. the algorithms and run the tasks concurrently.
Independent the tasks the better. Sometimes the problem will naturally break into a collection of
(nearly) independent tasks. Sometimes, not!
Data decomposition: Focus on the data used by the program.
Q: Are there enough tasks to keep the map all the H/W cores?
Decompose the program into tasks based on distinct chunks of
data. Q: Does each task have enough work to keep the individual cores
Efficiency depends on the independence of the chunks. busy?
Q: Are the number of tasks dependent or independent of the
Task decomposition may lead to data decomposition and vice number of H/W core?
versa. Q: Are these tasks relatively independent?
Q: Are they really independent? Instances of tasks: Independent modules, loop iterations.
Relation between tasks and ease of programming, debugging and
*
maintenance. *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 5 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 6 / 46

Task decomposition: Matrix multiplication example Finding concurrency in a given problem

C =A×B

N−1
X
Ci,j = Ai,k × Bk,j
k=0

“Resource” intensive parts?

Tasks in the problem?
Are tasks independent? Enough tasks for all the cores? Enough
work for each task? Size of tasks and number of cores?
Each element Ci,j is computed in a different task - row major.
Each element Ci,j is computed in a different task - column major.
Each element Ci,j is computed in a different task - diagonals.
How to reason about Performance? Cache effect? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 7 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 8 / 46
Data decomposition: Design Data decomposition: Matrix multiplication example
Besides identifying the “resource” intensive parts, identify the key
data structures required to solve the problem, and how is the data C =A×B
used during the solution.
N−1
Q: Is the decomposition suitable to a specific system or many X
Ci,j = Ai,k × Bk,j
systems?
k=0
Q: Does it scale with the size of parallel computer?
Are similar operations applied to different parts of data, “Resource” intensive parts?
independently? Data chunks in the problem?
Does it scale with the size of parallel computers?
Are there different chunks of data that can be distributed?
Operations (Reads/Writes) applied on independent parts of data?
Relation between decomposition and ease of programming,
Data chunks big enough to deem the thread activity beneficial?
debugging and maintenance.
How to decompose?
Examples:
Each row/column of Ci,j is computed in a different task.
Array based computations: concurrency defined in terms of
Each column of Ci,j is computed in a different task.
updates of different segments of the array/matrix.
Recursive data structures: concurrency by decomposing the Performance? Cache effect?
parallel updates of a large tree/graph/linked list. *
Note: Data decomposition also leads to task decomposition as*

well.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 9 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 10 / 46

Matrix multiplication: Data decomposition. Finding concurrency in a given problem

A1,1 A1,2 B1,1 B1,2
C = ×
A2,1 A2,2 B2,1 B2,2

A1,1 × B1,1 + A1,2 × B2,1 A1,1 × B1,2 + A1,2 × B2,2
=
A2,1 × B1,1 + A2,2 × B2,1 A2,1 × B1,2 + A2,2 × B2,2

Advantages
Can fit in the blocks into cache.
Can scale as per the hardware.
Overlap of communication and computation.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 11 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 12 / 46
Dependence analysis for managing parallelism: Dependence analysis for managing parallelism:
Grouping Ordering
Background: Tasks and Data decomposition has been done. Background: Tasks and Data decomposition has been done.
All the identified tasks may not run in parallel. Dependent tasks have been grouped together.
Q: How should related tasks be grouped to help manage the Ordering of the tasks and groups not trivial.
dependencies?
Dependent, related tasks should be (uniquely?) grouped together. Q: How should the groups be ordered to satisfy the constraints
among the groups and in turn tasks?
Temporal dependency: If task A depends on the result of task B, Dependent groups+tasks should be ordered to preserve the
then A must wait for the results from B. Q: Does A have to wait for original semantics.
B to terminate? Should not be overly restrictive.
Concurrent dependency: Tasks are expected to run in parallel, and Ordering is imposed by: Data + Control dependencies.
one depends on the updates of the other. Ordering can also be imposed by external factors: network, i/o and
Independent tasks: Can run in parallel or in sequence. Is it always so on.
better to run them in parallel? Ordering of independent tasks?
Advantage of grouping. Importance of grouping.
Grouping enforces partial orders between tasks. Ensures the program semantics.
Application developer thinks of groups, instead of individual tasks. A key step in program design.
* *
Example: Computing of individual rows.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 13 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 14 / 46

Dependence analysis for managing parallelism: data Issues in data sharing

sharing

Background: Tasks and Data decomposition has been done. Identify the data being shared - directly follows from the
Dependent tasks have been grouped together. The ordering between decomposition.
the groups and tasks have been identified. If sharing is done incorrectly - a task may get invalid data due to
Groups and tasks have some level of dependency among each race condition.
other. A naive way to guarantee correct shared data: synchronize every
Q: How is data shared among the tasks? read with barriers.
Synchronization of data across different tasks - may require
Identify the data updated/needed by individual tasks - task local communication. Options:
data.
Overlap of communication and computation.
Some data may be updated by multiple tasks - global data. Privatization.
Some data may be updated by one data used by multiple tasks - keep local copies of shared data.
remote data

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 15 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 16 / 46
One special case of sharing Finding concurrency in a given problem - deep dive

Accumulation/Reduction: Data being used to accumulate a result;

sum, minimum, maximum, variance etc.
Each core has a separate copy of data,
accumulation happens in these local copies.
sub-results are further used to compute the final result.
Example: Sum elements in an array A[1024]
Decompose the array into 32 chunk.
Accumulate each chunk separately.
Accumulate the sub results into the global “sum”.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 17 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 18 / 46

Managing parallelism - design evaluation Design evaluation factors

Suitability to the target platform (at a high level)

Number of cores / HW threads - too few/many tasks?
Homogeneous/Heterogeneous multi-cores? And work distribution.
Background: Tasks and Data decomposition has been done. Data distribution among the cores - equal/unequal?
Dependent tasks have been grouped together. The ordering between Cost of communication - fine/coarse grained data sharing.
the groups and tasks have been identified. A scheme for data sharing Amount of sharing - shared memory or distributed memory.
has also been identified. Metrics: simplicity (qualitative) , Efficiency , Flexibility
Of the multiple choices present at different points, we have chosen Flexibility
one. Flexible/Parametric over the number of cores/threads?
Flexible/Parametric over the number and size of data chunks?
Q: Is the chosen path a “good” one?
Does it handle boundary cases?
Efficiency.
Even load balancing?
Minimum overhead? - task creation, synchronization,
communication.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 19 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 20 / 46
Algorithm Structure - deep dive Algorithm Structure design

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 21 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 22 / 46

Task Parallelism Factors in efficient Task parallel algorithm design

Q: A problem is best decomposed into a collection of tasks that can

execute concurrently. How to exploit the concurrency efficiently? Tasks:
Problem can be decomposed into a collection of concurrent tasks.
1 Enough Tasks to keep the cores busy.
2 Advantage of creating the tasks should offset the overhead of
Tasks can be completely independent or can have dependencies. creating and managing them.
Tasks can be known from the beginning (producer/consumer), Dependencies
tasks are created dynamically. 1 Ordering constraints.
Solution may or not require all the tasks to finish. 2 Dependencies from shared data: synchronization, private data.
3 Schedule: creation and scheduling.
Challenges: Schedule
Assign tasks to cores - to result in a simple, flexible and efficient 1 How are the tasks assigned to cores.
execution. 2 How are the tasks scheduled.
Address the dependencies correctly.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 23 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 24 / 46
Example: Task parallel algorithm Solution to Branch and Bound ILP
Machine Job1 Job2 Job3 Job4
M1 4 4 3 5
M2 2 3 4 4

Maintain a list of tasks.

Remove a solution from the list.
Examine the solution. Either discard it or declare it a
solution, or add a sub-problem to task list.
The tasks depend depend on each other through the task-list.

Say Job1 to M1, Job2 to M2, Job3 to M1, Job4 to M2 = 7.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 25 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 26 / 46

Algorithm Structure design Divide and conquer

Q: Tasks are created recursively to solve a problem in a divide conquer

strategy. How to exploit the concurrency?
Divide and Conquer: Problem is solved by splitting it into a
number of smaller subproblems. Examples?
Each subproblems can be solved “fairly” independently. Directly or
further divide and conquer.
Solutions of the smaller problems is merged to compute the final
solution.
Each divide doubles the concurrency.
Each merge halves the concurrency.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 27 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 28 / 46
Divide and Conquer pattern: features Divide and conquer - example Mergesort

int[] mergesort(int[]A,int L,int H){

if (H - L <= 1) return;
if (H-L <= T ) {quickSort(A, L, H); return;}
int m = (L+H)/2;
A1 = mergesort(A, L, m);
A2 = mergesort(A, m+1, H);
return merge(A1, A2);
The amount of exploitable concurrency varies. // returns a merged sorted array.
At the beginning and end very little exploitable concurrency. }
Note: “split” and “merge” are serial parts. split cost?
Amdahl’s law - speed up constrained by the serial part. Impact?
merge cost?
Too many parallel threads?
What if cores are distributed? - data movement? Value of threshold T ?
Tasks are created dynamically - load balancing?
What if the sub-problems are not equal-sized? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 29 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 30 / 46

Algorithm Structure design Geometric decomposition

Q: How can an algorithm be organized around a data structure that
has been decomposed into concurrently updatable “chunks”?
Similar to decomposing a geometric region into subregions.
Linear Data structures (such as arrays) - can be often
decomposed into contiguous sub-structures.
These individual tasks are processed in different concurrent tasks.
Note: Sometimes all the required data for a task is present
“locally” (embarrassingly parallel - Task parallelism pattern). And
sometimes share data with “neighboring” chunks.
Challenges
Ensure that each task has access to all data it needs.
Mapping of chunks to cores giving good performance. Q: Why is it
a challenge?
Granularity of decomposition (coarse or fine-grain) - effect on
efficiency? Parametric? Tweaked at compile time or runtime?
* Shape of the chunk: Regular/irregular? *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 31 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 32 / 46
Geometric decomposition: Matrix multiplication Algorithm Structure design

× B
C = A
A1,1 A1,2 B1,1 B1,2
= ×
A2,1 A2,2 B2,1 B2,2

A1,1 × B1,1 + A1,2 × B2,1 A1,1 × B1,2 + A1,2 × B2,2
=
A2,1 × B1,1 + A2,2 × B2,1 A2,1 × B1,2 + A2,2 × B2,2

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 33 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 34 / 46

Recursive Data Pattern Recursive Data Pattern - example Find roots

Q: How can recursive data structures be partitioned so as that

operations on them are performed in parallel?
Linked list, tree, graphs . . .
Inherently operations on recursive data structures are serial - as
one has to sequentially move through the data structure.
For example linked list traversal or traversing a binary tree.
Sometimes it is possible to reshape operations to derive and Given a forest of rooted trees: compute the root of each node.
exploit concurrency. Serial version: Do a depth-first or breadth first traversal from root
to the leaf nodes.
For each visited node - set the root. Total running time?
Q: Is there concurrency?
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 35 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 36 / 46
Recursive Data structures: Parallel find roots Parallelizing recursive data structures

Recasting the problem increases the cost. Find a way to get it

back.
Effective exploitation of the derived concurrency depends on
factors such as - amount of work available for each task, amount
of serial code . . .
Restructuring may make the solution complex.
Requirement of synchronization - Why?
Another example: Find partial sums in a linked list.
Transformed the original serial computation to one where we
compute partial result and repeatedly combine partial results.
Total Cost = ?
Total cost = O(N log N)
However, if we exploit the parallelism - running time will come
down to O(log N). * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 37 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 38 / 46

Algorithm Structure design Pipieline pattern

Q: The computation may involve performing similar sets of operations
on many sets of data. Is there concurrency? How to exploit it?
Factory assembly line, Network Packet processing, Instruction
processing in CPUs etc.

There are ordering constraints on each operation on any one set

of data: Operation C2 can be undertaken only after C1 .
Key requirement: Number of operations > 1.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 39 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 40 / 46
Pipeline pattern features Pipieline pattern. Issues

Error handling.
Create a separate task for error handling - which will run exception
routines.
Once the pipeline is full maximum parallelism is observed. Processor allocation, load balancing
Number of stages should be small compared to the number of Throughput and Latency.
items processed.
Efficiency improves if time taken in each stage is roughly the
same. Else?
Amount of concurrency depends on the number of stages.
Too many stages, disadvantage?
Communication across stages? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 41 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 42 / 46

Overall big picture Event based coordination

Challenges
Identifying the tasks.
Identifying the events flow.
Enforcing the events ordering.
Avoiding deadlock.
Efficient communication of events.
Left for self reading.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 43 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 44 / 46
Algorithm Structure design Supporting structures

We have identified concurrency, and established an algorithm

structure.
Now how to implement the algorithm?
Issues
Clarity of abstraction - from algorithm to source code.
Scalability - how many processors can it use?
Efficiency - utilizing the resource of the computer, efficiently.
Example?
Maintainability - is it easy to debug, verify and modify?
Environment - hardware and programming environment.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 45 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 46 / 46

SPMD pattern SPMD example

Z 1
4
π= dx
0 1 + xx
Each UE executes the same program, but has different data.
int main () {
They can follow different paths through the program. How? // Initialization start
Code at different UEs can differentiate with each other using a int i;
unique ID. int numSteps = 1000000;
double x, pi, step, sum = 0.0;
Assumes that each underlying hardware are similar. step = 1.0/(double) numSteps;
Challenges // Initialization end
for (i=0;i< numSteps; i++) {
Interactions among the seemingly independent activities of UEs. x = (i+0.5)*step;
Clarity, Scalability, Efficiency, Maintainability (1m cores), sum = sum + 4.0/(1.0+x*x); }
Environment. // Finalization start
pi = step * sum;
How to handle code like initialization, finalization etc? printf("pi %lf\n",pi);
return 0;
// Finalization end
* *
}
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 47 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 48 / 46
SPMD translation. Inefficient? SPMD translation. Better?
int main () { int main () {
int i; int numSteps = 1000000;
int i; double x, pi, step, sum = 0.0;
int numSteps = 1000000; step = 1.0/(double) numSteps;
double x, pi, step, sum = 0.0; int numProcs = getNumProcs();
step = 1.0/(double) numSteps; int myID = getMyId();
int numProcs = numSteps; step = 1.0/numSteps;
int myID = getMyId();
iStart = myID * (numSteps / numprocs);
i = myID; iEnd = iStart * (numSteps / numprocs);
x = (i+0.5)*step; if (myID == numProcs-1) iEnd = numSteps;
sum = sum + 4.0/(1.0+x*x);
for (i = iStart; i < iEnd; ++i){
sum = step * sum; x = (i+0.5)*step;
DoReductionOverAllProcs(&sum, &pi); // blocking. sum = sum + 4.0/(1.0+x*x); }
if (myID == 0) printf("pi %lf\n",pi); sum = step * sum;
return 0; DoReductionOverAllProcs(&sum, &pi); // blocking.
*
if (myID == 0) printf("pi %lf\n",pi); *

} return 0; }
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 49 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 50 / 46

Supporting structure Master/Worker

Situation
workload at each task is variable and unpredictable (what if
predictable?).
Not easy to map to loop-based computation.
The underlying hardware have different capacities.
Master/Worker pattern
Has a logical master, and one or more instances of workers.
Computation by each worker may vary.
The master starts computation and creates a set of tasks.
Master waits for tasks to get over.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 51 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 52 / 46
Master/Worker layout Master/Worker Issues

Has good scalability, if number of tasks greatly exceed the number

of workers, and each worker roughly gets the same amount of
work (Why?).
Size of tasks should not be too small. Why?
Can work with any hardware platform.
How to detect completion? When can the workers not wait but
shutdown?
Easy if all tasks are ready before workers start.
Use of a poison-pill in the work-queue.
What if the workers can also add tasks? Issues?
Issues with asynchronous message passing systems?
How to handle fault tolerance? - did the task finish?
Variations
Master can also become a worker.
Distributed task queue instead of a centralized task queue.
(dis)advantages?
Q: How to implement the set of tasks? Characteristics of this data structure? * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 53 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 54 / 46

Master/Worker template for master Master/Worker - ForkJoin

int nTasks // Number of tasks
int nWorkers // Number of workers
public static SharedQueue taskQueue; // global task queue
public static SharedQueue resultsQueue; // queue to hold results
void master() { void ForkJoin(int nWorker){
// Create and initialize shared data structures Thread [] t = new Threads[nWorkers];
taskQueue = new SharedQueue(); for (int i=0;i<nWorker;++i) {
globalResults = new SharedQueue(); t[i] = new Thread(new Worker()) }
for (int i = 0; i < nTasks; i++) for (int i=0;i<nWorker;++i) {
enqueue(taskQueue, i); t[i].join();}
}
// Create nWorkers threads
ForkJoin (nWorkers);

consumeResults (nTasks);
} * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 55 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 56 / 46
Master/Worker - template for worker Supporting structure

class Worker(){
public void run() {
while (!(Master.taskQueue.empty()){
// atomically dequeue.
// do computation.
// add to globalResults atomically
} } }
Known uses
SETI@HOME
Map Reduce
”Map” step: The master node takes the input, partitions it up into
smaller sub-problems, and distributes those to worker nodes.
A worker may again partition the problem – multi-level tree structure.
The worker node processes that smaller problem, and passes the
answer back to its master node.
”Reduce” step: Master node takes all the answers and combines
them to get the output the answer to the original problem. * *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 57 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 58 / 46

Loop Parallelization Loop coalescing and merging for parallelization

A program has many computationally intensive loops, with Merging/Fusion

Coalescing
“independent” iterations. for (i : 1..n) {
Goal: Parallelize the loops and get most of the benefits. for (i : 0..m) {
S1
for (j : 0..n) {
Very narrow focus. }
S
Typical application: scientific and high performance computation. for (j : 1..n) {
}
S2
Impact of Amdahl’s law? -->
}
Quite amenable to refactoring type of incremental parallelization. for (ij : 0..m*n) {
-->
Advantage? j = ij % n;
for (i : 1..n) {
i = ij / n;
Impact on distributed memory systems? S1
S
Good if computation done in iterations compensates the cost of j = i;
}
thread creation - how to improve the tradeoff? Coalescing, S2
merging. }

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 59 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 60 / 46
Loop parallelization issues Loop parallelization example
Z 1
4
Distributed memory architectures. π= dx
0 1 + xx
False sharing : variables not needed to be shared, but are in the
same same cache line. Can incur high overheads.
int main () {
Seen in systems with distributed, coherent caches. int main () {

The caching protocol may force the reload of a cache line despite int i,numSteps=1000000;
int i,numSteps = 1000000;
a lack of logical necessity. double pi,step,sum=0.0;
double x,pi,step,sum=0.0;
step=1.0/(double)numSteps;
step=1.0/(double)numSteps;
foreach(j : [0..N]) {
foreach(j : [0..N]) { double tmp; forall(i: [0..numSteps]){
for(i: [0..numSteps]){
for(i=0; i<M; i++){ for(i=0; i<M; i++){ double x=(i+0.5)*step;
x=(i+0.5)*step;
double tmp=4.0/(1.0+x*x)
A[j]+= compute(j,i); tmp += compute(j,i); sum=sum+4.0/(1.0+x*x);}
atomic sum=sum+tmp; }
} }
} atomic A[j] += tmp; pi=step*sum;
pi = step * sum;
} printf("pi %lf\n",pi);
printf("pi %lf\n",pi);
return 0; }
*
return 0; } *
Reading material: Automatic loop parallelization.
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 61 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 62 / 46

Supporting structure Fork/Join

The number of concurrent tasks varies as the program executes.

Parallelism beyond just loops.
Tasks created dynamically (beyond master-worker).
One or more tasks waits for the created tasks to terminate.
Each task may or not result in an actual UE. Many-to-one
mapping. Examples?

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 63 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 64 / 46
Fork/Join - example Mergesort Supporting structures and algorithm structure

int[] mergesort(int[]A,int L,int H){

if (H-L <= T ) {quickSort(A, L, H); return;}
int m = (L+H)/2;
A1 = mergesort(A, L, m); // fork
A2 = mergesort(A, m+1, H); // fork
// join.
return merge(A1, A2);
// returns a merged sorted array. Homework
} OpenMP MPI Java X10 UPC Cilk Hadoop
SPMD ??? ???? ??
Issues Loop Parallelism ???? ? ???
Cost. Master/Worker ?? ??? ???
Alternatives? Fork/Join ??? ????

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 65 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 66 / 46

Supporting structure Shared Data

Million dollar question: How to handle shared data?

Managing shared data incurs overhead.
Scalability can become an issue.
Can lead to programmability issues.
Avoid if possible - by
replication,
privatization,
reduction.
Use appropriate concurrency control. Why?
Should preserve the semantics.
Should not be too conservative.
Shared data organization: distributed or at a central location?
Shared Queue (remember master-worker?) is a type of shared
data.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 67 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 68 / 46
Issues with shared data Issues with shared data

Data race and interference: Two shared activities access a shared Deadlocks : two or more competing actions are each waiting for
data. And at least one of them is a write. The activities said to the other to finish.
lockA → lockB
interfere. (Example via nested locks)
lockB → lockA
forall (i:[1..n]) {
One way to avoid: partial order among locks. Locks are acquired
sum += A[i];
in an order respecting the partial order.
}
Livelocks : the states of the processes involved in the livelock
for (i[1..n]) { constantly change with regard to one another, none progressing.
forall (j=1;j<m;++j) { Example: recovery from deadlock - If more than one process
A[i][j]=(A[i-1][j-1]+A[i-1][j]+A[i-1][j+1])/3; takes action, the deadlock detection algorithm can be repeatedly
} triggered leading to a livelock
}
Locality : Trivial if data is not shared.
Dependencies : Use synchronization (locks, barriers, atomics, . . . ) Memory synchronization: when memory / cache is distributed.
to enforce the dependencies.
How to implement all-to-all synchronization?
Task scheduling - tasks might be suspended for access to shared
*
data. Minimize the wait. *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 69 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 70 / 46

Supporting structure Distributed Array

Arrays often are partitioned between multiple tasks.

Goal: Efficient code, programmability.
Distribute the arrays such that elelement needed by a task is
“available” and “nearby”.
Array element redistribution?
An abstraction is needed: a map from elements to places.
Some standard ones: Blocked, Cyclic, Blocked cyclic, Unique,
Chosing a distribution.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 71 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 72 / 46
Supporting structure Implementation Mechanisms

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 73 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 74 / 46

UE management Synchronization: Memory synchronization and fences

Synchronization: Enforces constraint among parallel events.

done=true;
done = false;
UE - unit of execution (a process / thread / activity) while(done) ;
Difference between process / thread / activity. Value may be present in cache. cache coherence may take care.
Value may be present in a register - Culprit compiler.
Management = Creation, execution, termination. Value may not be read. How?
Varies with different underlying languages.
Go back to first few lectures for a recap. x=y=0
Thread 1 Thread 2
1: r1 = x 4: x = 1
2: y = 1 r3 = y
3: r2 = x
r1 == r2 == r3 == 0. Possible?
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 75 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 76 / 46
Synchronization: Memory synchronization and fences Syncrhonization: Barriers

A memory fences guarantees that the UEs will see a consistent

view of memory.
Writes performed before the fence will be visible to reads Barrier is a synchronization point at which every member of a
performed after the fence. collection of UEs must arrive before any member can proceed.
Reads performed after the fence will obtain a value written no MPI Barrier, join, finish, clocks, phasers
earlier than the latest write before the fence. Implemented underneath via passing messages.
Only for shared memory.
Explicit management can be error prone. High level: OpenMP
flush, shared, Java - volatile. Read yourself.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 77 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 78 / 46

Phasers1 Power of Phaser - pipeline parallelism2

1 * 2 *
Thanks - Jun Shirako Thanks - Jun Shirako
V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 79 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 80 / 46
Syncrhonization Implementation Mechanisms

Memory fence
Barriers
Mutual exclusion: Java synchronized, omp set lock,
omp unset lock.

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 81 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 82 / 46

Communication Collective communication

When multiple UEs participate in a single communication event, the

UEs need to exchange information. event is called a collective communication operation. Examples:
Shared memory - easy. Challenge - synchronize the memory
access so that results are correct irrespective of scheduling. Broadcast: a mechanism to send single message to all UEs.
distributed memory - not much need for synchronization to protect Barriers : a synchronization point.
the resources. → Communication plays a big role. Reduction: Take a collection of objects, one from each UE, and
One to one communication : “combine” into a single value;
Between all UEs in one event: Collective communication. combined value present only on one UE?
combined value present on all UEs?

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 83 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 84 / 46
Serial reduction Tree based reduction

Reduction with 2n items takes n steps.

Reduction with n items takes n steps. What if number of UEs < number of data items?
Useful especially if the reduction operator is not associative. Only one UE knows the result.
Only one UE knows the result. * Associative + Commutative or don’t care (example?) *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 85 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 86 / 46

Recursive doubling Implementation Mechanisms

Reduction with 2 × n items takes n steps.

What if number of UEs < number of data items?
All UEs know the result.
* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 87 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 88 / 46
Overall big picture Sources

Patterns for Parallel Programming: Sandors, Massingills.

multicoreinfo.com
Wikipedia
fixstars.com
Jernej Barbic slides.
Loop Chunking in the presence of synchronization.
Java Memory Model JSR-133: “Java Memory Model and Thread
Specification Revision”

* *

V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 89 / 46 V.Krishna Nandivada (IIT Madras) CS6868 (IIT Madras) 90 / 46

Cheatsheet Leetcode A4
No ratings yet
Cheatsheet Leetcode A4
8 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
High Performance Computing (HPC) - Lec2
No ratings yet
High Performance Computing (HPC) - Lec2
53 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
CSC 580 - Chapter 3
No ratings yet
CSC 580 - Chapter 3
35 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Common PDC Module3
No ratings yet
Common PDC Module3
43 pages
Lecture 6
No ratings yet
Lecture 6
37 pages
Unit 2
No ratings yet
Unit 2
151 pages
Unit 2
No ratings yet
Unit 2
64 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
Unit 2_Part_1
No ratings yet
Unit 2_Part_1
32 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
AA-Part1 (1)
No ratings yet
AA-Part1 (1)
43 pages
Con Currency Mapping
No ratings yet
Con Currency Mapping
40 pages
Parallel Programming Models: Sathish Vadhiyar
No ratings yet
Parallel Programming Models: Sathish Vadhiyar
32 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
Partitioning
No ratings yet
Partitioning
37 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
3.1.3 Processes and Mapping (1/5)
No ratings yet
3.1.3 Processes and Mapping (1/5)
74 pages
Hpc_unit-2 Insem Notes
No ratings yet
Hpc_unit-2 Insem Notes
99 pages
Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
Parallel Algorithm Design: 3.1 Task/Channel Model
No ratings yet
Parallel Algorithm Design: 3.1 Task/Channel Model
27 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
ICS 311 PADC Foaster Algorithm Design (1)
No ratings yet
ICS 311 PADC Foaster Algorithm Design (1)
54 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
unit1 2 and 3
No ratings yet
unit1 2 and 3
76 pages
Processes and Mapping, Decomposition Techniques
No ratings yet
Processes and Mapping, Decomposition Techniques
28 pages
Lec04b-Processes and Mapping
No ratings yet
Lec04b-Processes and Mapping
26 pages
Programming For Performance
No ratings yet
Programming For Performance
79 pages
Parallel and Distributed lec 8
No ratings yet
Parallel and Distributed lec 8
24 pages
Lecture 3 and 4HPC
No ratings yet
Lecture 3 and 4HPC
24 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Parallel Programming
No ratings yet
Parallel Programming
18 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
in3200-chap05
No ratings yet
in3200-chap05
34 pages
How To Parallelise An Application
No ratings yet
How To Parallelise An Application
30 pages
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
No ratings yet
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
18 pages
Algorithm Design And Problem Solving
No ratings yet
Algorithm Design And Problem Solving
13 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Questions On Chapter 1 and 2 Color New V2
No ratings yet
Questions On Chapter 1 and 2 Color New V2
8 pages
Unit 2 HPC
No ratings yet
Unit 2 HPC
92 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
42 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Napds 18
No ratings yet
Napds 18
2 pages
CATIA V5 Design Fundamentals: Jaecheol Koh
100% (1)
CATIA V5 Design Fundamentals: Jaecheol Koh
45 pages
Technical Data Sheet: N-Channel Mosfet
No ratings yet
Technical Data Sheet: N-Channel Mosfet
4 pages
Array Interview Question
No ratings yet
Array Interview Question
17 pages
RNP - 54319500reporting Notice Biometric
No ratings yet
RNP - 54319500reporting Notice Biometric
1 page
Zenia 2016
No ratings yet
Zenia 2016
25 pages
Survey On Multilevel Security Using Honeypot
No ratings yet
Survey On Multilevel Security Using Honeypot
5 pages
Run Nslookup To Obtain The IP Address of A Web Server in Asia. What Is The IP Address of That Server?
No ratings yet
Run Nslookup To Obtain The IP Address of A Web Server in Asia. What Is The IP Address of That Server?
9 pages
Leica Rugby 810, 820, 830, 840 Brochure
No ratings yet
Leica Rugby 810, 820, 830, 840 Brochure
12 pages
Resume
No ratings yet
Resume
3 pages
Cloud Security Checklist
No ratings yet
Cloud Security Checklist
2 pages
Arduino Water Pressure Sensor Project, Water Level Pressure Sensor
100% (1)
Arduino Water Pressure Sensor Project, Water Level Pressure Sensor
21 pages
Info Tech Sba 2017-2019 1
No ratings yet
Info Tech Sba 2017-2019 1
5 pages
Biodiesel Technology and Applications Inamuddin all chapter instant download
100% (3)
Biodiesel Technology and Applications Inamuddin all chapter instant download
41 pages
Computer 2
No ratings yet
Computer 2
4 pages
Sbi Po Pre
No ratings yet
Sbi Po Pre
2 pages
Oracle Programming Using PL/SQL: Level 2 WWW - Micros.umsl - Edu
No ratings yet
Oracle Programming Using PL/SQL: Level 2 WWW - Micros.umsl - Edu
2 pages
An Introduction To Trigger in DB2 For OS
100% (4)
An Introduction To Trigger in DB2 For OS
7 pages
FCB1010 Phantom Power From VG-99 RRC2
No ratings yet
FCB1010 Phantom Power From VG-99 RRC2
6 pages
Battery Removal and Installation: Use The Label!
No ratings yet
Battery Removal and Installation: Use The Label!
1 page
Specimen (2023) QP - Paper 2 CAIE Computer Science GCSE
No ratings yet
Specimen (2023) QP - Paper 2 CAIE Computer Science GCSE
16 pages
Jeremy Carson NZ Oracle User Group Presentation - An Overview of Enterprise Asset Management
No ratings yet
Jeremy Carson NZ Oracle User Group Presentation - An Overview of Enterprise Asset Management
59 pages
Batch 04 Base Paper
No ratings yet
Batch 04 Base Paper
5 pages
Immediate Download Algorithms 1st Edition Jeff Erickson Ebooks 2024
100% (5)
Immediate Download Algorithms 1st Edition Jeff Erickson Ebooks 2024
62 pages
ANSWER KEY (7) - 70 Que
No ratings yet
ANSWER KEY (7) - 70 Que
4 pages
Product Brochure Idirect Hub
No ratings yet
Product Brochure Idirect Hub
6 pages
Case Study of Library
No ratings yet
Case Study of Library
5 pages
Lance Win Alexandrei B. de Leon: 21st Century Media: A New Learning Avenue For Youth or A Detrimental Information Medium?
No ratings yet
Lance Win Alexandrei B. de Leon: 21st Century Media: A New Learning Avenue For Youth or A Detrimental Information Medium?
2 pages
Flexible Manufacturing: Reference For Business
No ratings yet
Flexible Manufacturing: Reference For Business
5 pages

Lecture4 PDF

Uploaded by

Lecture4 PDF

Uploaded by

A pattern language

CS6868 - Concurrent Programming Pattern: “a careful description of a perennial solution to a

Structuring the algorithm to take advantage of potential concurrency.

How the high level specifications are mapped.

Identify “resource” intensive parts of the problem.

Task decomposition: Matrix multiplication example Finding concurrency in a given problem

“Resource” intensive parts?

Matrix multiplication: Data decomposition. Finding concurrency in a given problem

Dependence analysis for managing parallelism: data Issues in data sharing

Accumulation/Reduction: Data being used to accumulate a result;

Managing parallelism - design evaluation Design evaluation factors

Suitability to the target platform (at a high level)

Task Parallelism Factors in efficient Task parallel algorithm design

Q: A problem is best decomposed into a collection of tasks that can

Maintain a list of tasks.

Say Job1 to M1, Job2 to M2, Job3 to M1, Job4 to M2 = 7.

Algorithm Structure design Divide and conquer

Q: Tasks are created recursively to solve a problem in a divide conquer

int[] mergesort(int[]A,int L,int H){

Algorithm Structure design Geometric decomposition

Recursive Data Pattern Recursive Data Pattern - example Find roots

Q: How can recursive data structures be partitioned so as that

Recasting the problem increases the cost. Find a way to get it

Algorithm Structure design Pipieline pattern

There are ordering constraints on each operation on any one set

Overall big picture Event based coordination

We have identified concurrency, and established an algorithm

SPMD pattern SPMD example

Supporting structure Master/Worker

Has good scalability, if number of tasks greatly exceed the number

Master/Worker template for master Master/Worker - ForkJoin

Loop Parallelization Loop coalescing and merging for parallelization

A program has many computationally intensive loops, with Merging/Fusion

Supporting structure Fork/Join

The number of concurrent tasks varies as the program executes.

int[] mergesort(int[]A,int L,int H){

Supporting structure Shared Data

Million dollar question: How to handle shared data?

Supporting structure Distributed Array

Arrays often are partitioned between multiple tasks.

UE management Synchronization: Memory synchronization and fences

Synchronization: Enforces constraint among parallel events.

A memory fences guarantees that the UEs will see a consistent

Phasers1 Power of Phaser - pipeline parallelism2

Communication Collective communication

When multiple UEs participate in a single communication event, the

Reduction with 2n items takes n steps.

Recursive doubling Implementation Mechanisms

Reduction with 2 × n items takes n steps.

Patterns for Parallel Programming: Sandors, Massingills.

You might also like