Concurrent Programiing Tutorial-2
Concurrent Programiing Tutorial-2
• Synchronization primitive : TAS, TTAS, BTAS
• Example of Parallel Programming
– Shared memory : C/C++ Pthread, C++11 thread,
OpenMP, Cilk
– Distributed Memory : MPI
• Concurrent Objects
Dr A Sahu
– Concurrent Queue, List, stack, Tree, Priority
Dept of Computer Science & Queue, Hash, SkipList
Engineering
IIT Guwahati • Use of Concurrent objects
import java.util.concurrent.atomic
public class AtomicBoolean { • Locking
boolean value; – Lock is free: value is false
– Lock is taken: value is true
public synchronized boolean • Acquire lock by calling TAS
getAndSet(boolean newValue) {
boolean prior = value; – If result is false, you win
value = newValue; – If result is true, you lose
return prior; • Release lock by writing false
}
}
Test‐and‐set Lock
class TASlock { class TTASlock {
AtomicBoolean state = AtomicBoolean state =
new AtomicBoolean(false); new AtomicBoolean(false);
o d lock()
void oc () { void lock() { Then try to acquire it
while (state.getAndSet(true)) {} while (true) {
} while (state.get()) {}
if (!state.getAndSet(true))
void unlock() { return;
state.set(false); }
} }
} Keep trying until lock acquired Wait until lock looks free
1
9/23/2014
• Shared memory
• We need to indentify parallelism
– Pthread, C++11 thread
– How to do extract parallelism manually
– Java
– Parallel Decomposition
– OpenMP
– Cilk • Code in threaded model
C d i th d d d l
• Distributed Memory
– MPI • OS is responsible for running it efficiently
– Less control over runtime
• Data Parallelism
• Data Parallelism a b c d e f
• Function Parallelism
• Pipeline Parallelism
Pipeline Parallelism Cap Cap Cap Cap Cap Cap
• Mixed Parallelism (D+F+P)
A B C D E F
• Function Parallelism • Pipeline Parallelism
1 4 9 12 6 14 3
a
A
a
7 1 15 5.243
2
9/23/2014
sum = 0;
#pragma omp parallel private (lsum) Shared variable
{
lsum = 0; sum = 0;
#pragma omp for #pragma omp parallel for reduction (+:sum)
for (i=0; i<N; i++) { for (i=0;
(i 0; i<N;
i N; i++)
i ) {
lsum = lsum + A[i]; sum = sum + A[i];
} }
#pragma omp critical
{ sum += lsum; }
Threads wait their turn;
} only one thread at a time
executes the critical section
3
9/23/2014
OpenMP Schedule
• Although the OpenMP standard does not
yCan help OpenMP decide how to handle specify how a loop should be partitioned
parallelism • Most compilers split the loop in N/p (N
schedule(type [,chunk]) #iterations, p #threads) chunks by default.
y Schedule Types
S h d l • This is called a static schedule (with chunk size
This is called a static schedule (with chunk size
N/p)
◦ Static – Iterations divided into size chunk, if
– For example, suppose we have a loop with 1000
specified, and statically assigned to threads iterations and 4 omp threads. The loop is
◦ Dynamic – Iterations divided into size chunk, partitioned as follows:
if specified, and dynamically scheduled
Thread 0 Thread 1 Thread 2 Thread 3
among threads
0 250 500 750 1000
• A loop with 1000 iterations and 4 omp • With static scheduling the number of iterations is evenly
threads. Static Schedule with Chunk 10 distributed among all openmp threads (i.e. Every thread
will be assigned similar number of iterations).
#pragma omp parallel for schedule (static, 10) • This is not always the best way to partition. Why is This?
{ Iterations
for (i=0; i<1000; i++)
A[i] = B[i] + C[i];
} T
i
m
Thread0 A[i] = sqrt(B[i]+C[i]);
e
Example: Sqrt timing is data dependent…
…………………………….. This is called load imbalance. In this case
threads 2,3, and 4 will be waiting very
0 10 20 30 40 1000 long for thread 1 to finish
• With a dynamic schedule new chunks are
assigned to threads when they come available. • http://users.abo.fi/mats/PP2012/examples/O
• SCHEDULE(DYNAMIC,n) penMP/
– Loop iterations are divided into pieces of size chunk.
When a thread finishes one chunk, it is dynamically
g
assigned another.
• SCHEDULE(GUIDED,n)
– Similar to DYNAMIC but chunk size is relative to
number of iterations left.
• Although Dynamic scheduling might be the
prefered choice to prevent load inbalance
– In some situations, there is a significant overhead
involved compared to static scheduling.
4
9/23/2014
Parallelizing Vector Addition
Identifies a function as void vadd (real *A, real *B, int n){
cilk int fib (int n) { a Cilk procedure, C int i; for (i=0; i<n; i++) A[i]+=B[i];
}
if (n<2) return (n); capable of being
else { spawned in parallel.
int x,y;
x = spawn fib(n-1);
y = spawn fib(n-2);
sync;
return (x+y); The named child
} Cilk procedure can
} execute in parallel
with the parent
Control cannot pass this caller.
point until all spawned
children have returned.
Parallelizing Vector Addition Parallelizing Vector Addition
void vadd (real *A, real *B, int n){ void vadd (real *A, real *B, int n){
C int i; for (i=0; i<n; i++) A[i]+=B[i]; C int i; for (i=0; i<n; i++) A[i]+=B[i];
} }
void vadd (real *A, real *B, int n){ cilk void vadd (real *A, real *B, int n){
C if (n<=BASE) { C ilk if (n<=BASE) {
int i; for (i=0; i<n; i++) A[i]+=B[i]; int i; for (i=0; i<n; i++) A[i]+=B[i];
} else { } else {
vadd (A, B, n/2);
/ spawn vadd (A, B, n/2;
vadd (A+n/2, B+n/2, n-n/2); spawn vadd (A+n/2, B+n/2, n-n/2;
} sync;
}
}
}
5
9/23/2014
The cilkc
source‐to‐source compiler
translator • Message Passing Interface
Cilk cilk2c encapsulates
source the process. • Distributed memory multiprocessor : ‐ cluster
C compiler programming
p
C post‐ gcc Cilk g
• Scalable to Large Number of Processors
source RTS
• Send (), Recv() construct
• It uses processes not thread
cilk2c translates object ld binary
code • Not part of this course
straight C code into
identical C postsource. linking
loader
$ cilkc fib.cilk –o fib
$ ./fib ‐proc 4 5 // run using 4 threads. information at runtime
if(myid == source){
#include <stdio.h>
buffer=5678;
#include <stdlib.h>
MPI_Send(&buffer,count,MPI_INT,
#include <mpi.h>
destination,tag,MPI_COMM_WORLD);
#include <math.h>
printf("processor %d sent %d\n",myid,buffer);
int main( int argc, char *argv[]) {
}
int myid, numprocs, tag, source, destination, count, buffer;
if(myid == destination){
MPI_Status status;
MPI_Recv(&buffer,count,MPI_INT,
MPI_Init(&argc,&argv);
source,tag,MPI_COMM_WORLD,&status);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
printf("processor %d got %d\n",myid,buffer);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
}
tag=1234;
MPI_Finalize();
source=0; destination=1; count=1;
}
• Exploits section that can be parallelized easily
– Cuda GPU (Very simplistic, completely SPMD) • Classical Data structure (simply DS)
– OpenMP (A bit of more complex SPMD with control of – Complexity of Add, Del, Search, Modify, Rearrange
reduction, critical and scheduling )
• Concurrent Data Structure (CDS)
• How to handle critical path efficiently
gf
• Testing for Good Driver – Many thread access to CDS
– Driving 120kmph in express high way or drive for 2 – Consistency and Serialization
hour in Guwahati city in evening 4PM to 6PM near – Performance of locking
fancy bazar.
• How to overall speed up the application by trying – Property should hold
to parallelize the not the easiest part – No: live lock, deadlock, starvation
– The data storing, adding, removing and retrieving,
organizing.. • Wait free and lock free CDS
– This leads to design Concurrent Data Strcuture
6
9/23/2014
• Java, C++11 and C# (Microsoft VC++)
– Java thread safe/concurrent collections • List, Queue, Stack, Hash, Priority queue,
• /usr/share/javadoc/java‐1.6.0‐ skiplist
openjdk/api/java/util/concurrent/
– C++11 : through Boost library or Libcds
• Discussion on internal implementation of
– C# The Parallel Patterns Library (PPL) some of these : list q stack pq hash
some of these : list, q, stack, pq, hash..
• concurrent_vector_class, Concurrent_queue_class • Use this CDS to make parallel application
• All have competitive memory model faster
• Use Java
– Still I think Java is better in CDS..
– Shavit: author of AMP Book used Java
• Ensure no thread can see a state where invariant
of the DS have broken by the action of other • Can the scope of locks be restricted to allow
thread. some parts of an operation to be performed
• Take care to avoid race condition inherent in the outside the lock?
interface to the DS by proving function for
complete functions for the operation rather than • Can different part of the DS be protected by
for operation steps. Race condition (unexpected
f di i ( d different locks?
different locks?
output) • Do all operations require the same level of
• Pay attention to how the DS behaves in the protection?
presence of exceptions to ensure that the
invariants are not broken. • Can a simple change to the DS improve the
• Minimized the opportunities for deadlock when improve the opportunities for concurrency
using the DS by restricting the scope of locked without affecting the operational semantics ?
and avoid nested locks where possible.
• More than one thread must be able to access the
DS concurrently • Lock free DS with property: every thread
– Example: A Queue might allow one thread to push
accessing the DS can complete its operation
and others to pop but may not same push/pop.
within bounded number of steps, regardless
• If one thread accessing DS and get suspended by
of behavior of other threads
of behavior of other threads.
OS in midway , other thread should be able to
access without waiting for suspended thread • Coding for wait free DS correctly is extremely
• Lock free may not be wait free: it may have hard
starvation
– 1TS 2T 1TS 2T, …..sequence lead to starvation by a
RR scheduler
7
9/23/2014
.. spin
CS
• Each method locks the object • Sequential bottleneck
..
CS
–Easy to reason about • Adding more threads
• In simple cases – Does not improve throughput
i h h
– Struggle to keep it from getting worse
• So, are we done? • So why even use a multiprocessor?
– Well, some apps inherently parallel …
• NOPE