0% found this document useful (0 votes)
48 views

Concurrent Programiing Tutorial-2

The document discusses various topics related to parallel programming including synchronization primitives, examples of parallel programming using shared and distributed memory, and the use of concurrent objects. It also covers different types of parallelism such as data, function, and pipeline parallelism. Finally, it discusses parallelizing code using compiler directives like OpenMP and handling shared variables in parallel code.

Uploaded by

niku007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Concurrent Programiing Tutorial-2

The document discusses various topics related to parallel programming including synchronization primitives, examples of parallel programming using shared and distributed memory, and the use of concurrent objects. It also covers different types of parallelism such as data, function, and pipeline parallelism. Finally, it discusses parallelizing code using compiler directives like OpenMP and handling shared variables in parallel code.

Uploaded by

niku007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

9/23/2014

• Synchronization primitive : TAS, TTAS, BTAS
• Example of Parallel Programming 
– Shared memory : C/C++ Pthread, C++11 thread, 
OpenMP, Cilk
– Distributed Memory : MPI
• Concurrent Objects
Dr A Sahu
– Concurrent Queue, List, stack, Tree, Priority 
Dept of Computer Science &  Queue, Hash, SkipList
Engineering 
IIT Guwahati • Use of Concurrent objects

import java.util.concurrent.atomic
public class AtomicBoolean { • Locking
boolean value; – Lock is free: value is false
– Lock is taken: value is true
public synchronized boolean • Acquire lock by calling TAS
getAndSet(boolean newValue) {
boolean prior = value; – If result is false, you win
value = newValue; – If result is true, you lose 
return prior; • Release lock by writing false
}
}

Swap old and new values

Test‐and‐set Lock
class TASlock { class TTASlock {
AtomicBoolean state = AtomicBoolean state =
new AtomicBoolean(false); new AtomicBoolean(false);

o d lock()
void oc () { void lock() { Then try to acquire it
while (state.getAndSet(true)) {} while (true) {
} while (state.get()) {}
if (!state.getAndSet(true))
void unlock() { return;
state.set(false); }
} }
} Keep trying until lock acquired Wait until lock looks free

1
9/23/2014

• Shared memory 
• We need to indentify parallelism
– Pthread, C++11 thread
– How to do extract parallelism manually 
– Java
– Parallel Decomposition 
– OpenMP
– Cilk • Code in threaded model
C d i th d d d l
• Distributed Memory 
– MPI • OS is responsible for running it efficiently 
– Less control over runtime

• Data Parallelism 
• Data Parallelism  a b            c d e       f
• Function Parallelism
• Pipeline Parallelism
Pipeline Parallelism Cap Cap Cap Cap Cap Cap
• Mixed Parallelism (D+F+P)
A B  C D E F

• Function Parallelism • Pipeline Parallelism

1 4  9 12 6 14   3

a
A
a

SPIN CAP Shad


AVG MINIMUM BINARY OR GEO‐MEAN

7 1 15 5.243

2
9/23/2014

• Compiler directive: Automatic parallelization Sequential  for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }


code
• Auto generate thread and get synchronized  #pragma omp parallel
#include <openmp.h> {
main(){ (Semi)  int id = omp_get_thread_num();
manual  int Nthr = omp_get_num_threads();
#pragma
p g ompp p parallel int istart = id*N/Nthr, iend
id*N/Nthr iend = (id+1)*N/Nthr;
(id+1)*N/Nthr;
#pragma omp for schedule(static) parallel
ll l
for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; }
{ }
for (int i=0; i<N; i++) {
#pragma omp parallel
a[i]=b[i]+c[i]; Auto  #pragma omp for schedule(static)
} $ gcc –fopenmp test.c parallel {
} $ export OMP_NUM_THREADS=4
of the for  for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }
} $./a.out
loop }

#pragma omp parallel


#pragma omp for #pragma omp parallel printf(“program begin\n”);
for(i =1;i<13; i++) N = 1000;
#pragma omp for Serial
c[i]=a[i]+b[i];
#pragma omp parallel for
i=1 i=5 i=9 for (i=0; i<N; i++)
z Threads are assigned an 
Threads are assigned an Parallel
e
i=2 i=6 i = 10 A[i] = B[i] + C[i];
C[i]
independent set of  i=3 i=7 i = 11

iterations i=4 i=8 i = 12 M = 500; Serial


#pragma omp parallel for
z Threads must wait at the  Implicit barrier
Parallel
for (j=0; j<M; j++)
end of work‐sharing 
p[j] = q[j] – r[j];
construct
printf(“program done\n”); Serial

sum = 0;
#pragma omp parallel private (lsum) Shared variable
{
lsum = 0; sum = 0;
#pragma omp for #pragma omp parallel for reduction (+:sum)
for (i=0; i<N; i++) { for (i=0;
(i 0; i<N;
i N; i++)
i ) {
lsum = lsum + A[i]; sum = sum + A[i];
} }
#pragma omp critical
{ sum += lsum; }
Threads wait their turn;
} only one thread at a time
executes the critical section

3
9/23/2014

OpenMP Schedule
• Although the OpenMP standard does not 
yCan help OpenMP decide how to handle  specify how a loop should be partitioned
parallelism • Most compilers split the loop in N/p (N 
schedule(type [,chunk]) #iterations, p #threads) chunks by default.  
y Schedule Types
S h d l • This is called a static schedule (with chunk size 
This is called a static schedule (with chunk size
N/p)
◦ Static – Iterations divided into size chunk, if 
– For example, suppose we have a loop with 1000 
specified, and statically assigned to threads iterations and 4 omp threads. The loop is 
◦ Dynamic – Iterations divided into size chunk,  partitioned as follows:
if specified, and dynamically scheduled 
Thread 0 Thread 1 Thread 2 Thread 3
among threads
0 250 500 750 1000

• A loop with 1000 iterations and 4 omp • With static scheduling the number of iterations is evenly 
threads. Static Schedule with Chunk 10 distributed among all openmp threads (i.e. Every thread 
will be assigned similar number of iterations). 
#pragma omp parallel for schedule (static, 10) • This is not always the best way to partition. Why is This?
{ Iterations
for (i=0; i<1000; i++)
A[i] = B[i] + C[i];
} T
i
m
Thread0  A[i] = sqrt(B[i]+C[i]);
e
Example: Sqrt timing is data dependent… 

…………………………….. This is called load imbalance. In this case 
threads 2,3, and 4 will be waiting very 
0 10 20 30 40 1000 long for thread 1 to finish 

• With a dynamic schedule new chunks are 
assigned to threads when they come available.  • http://users.abo.fi/mats/PP2012/examples/O
• SCHEDULE(DYNAMIC,n)   penMP/
– Loop iterations are divided into pieces of size chunk. 
When a thread finishes one chunk, it is dynamically 
g
assigned another.  
• SCHEDULE(GUIDED,n) 
– Similar to DYNAMIC but chunk size is relative to 
number of iterations left.
• Although Dynamic scheduling might be the 
prefered choice to prevent load inbalance
– In some  situations, there is a significant overhead 
involved compared to static scheduling.

4
9/23/2014

int fib (int n) {


• Developed by Leiserson at CSAIL, MIT if (n<2) return (n);
• Addition of 6 keyword to standard C
else { Cilk code
int x,y; cilk int fib (int n) {
– Easy to install in linux system: with gcc and pthread x = fib(n-1); if (n<2) return (n);
y = fib(n-2);
• Biggest principle  return (x+y);
else {
int x,y;
– Programmer should be responsible for exposing the  } x = spawn fib(n-1);
parallelism, identifying elements that can safely be 
parallelism identifying elements that can safely be } y = spawn fib(n-2);
executed in parallel;  sync;
– Work of run‐time environment (scheduler) to decide  C elision return (x+y);
during execution how to actually divide the work  }
between processors }
• Work Stealing Scheduler  Cilk is a faithful extension of C.  A Cilk program’s 
– Proved to be good scheduler serial elision is always a legal implementation of 
– Now also in GCC, Intel CC, Intel acquire Cilk++ Cilk semantics.  Cilk provides no new data types.

Parallelizing Vector Addition
Identifies a function as  void vadd (real *A, real *B, int n){
cilk int fib (int n) { a Cilk procedure,  C int i; for (i=0; i<n; i++) A[i]+=B[i];
}
if (n<2) return (n); capable of being 
else { spawned in parallel.
int x,y;
x = spawn fib(n-1);
y = spawn fib(n-2);
sync;
return (x+y); The named child
} Cilk procedure can 
} execute in parallel 
with the parent
Control cannot pass this  caller.
point until all spawned 
children have returned.

Parallelizing Vector Addition Parallelizing Vector Addition
void vadd (real *A, real *B, int n){ void vadd (real *A, real *B, int n){
C int i; for (i=0; i<n; i++) A[i]+=B[i]; C int i; for (i=0; i<n; i++) A[i]+=B[i];
} }

void vadd (real *A, real *B, int n){ cilk void vadd (real *A, real *B, int n){
C if (n<=BASE) { C ilk if (n<=BASE) {
int i; for (i=0; i<n; i++) A[i]+=B[i]; int i; for (i=0; i<n; i++) A[i]+=B[i];
} else { } else {
vadd (A, B, n/2);
/ spawn vadd (A, B, n/2;
vadd (A+n/2, B+n/2, n-n/2); spawn vadd (A+n/2, B+n/2, n-n/2;
} sync;
}
}
}

Parallelization strategy: Parallelization strategy: Side benefit:


D&C is generally good for 
1. Convert loops to recursion. 1. Convert loops to recursion. caches!
2. Insert Cilk keywords.

5
9/23/2014

The cilkc
source‐to‐source compiler 
translator • Message Passing Interface
Cilk cilk2c encapsulates 
source the process. • Distributed memory  multiprocessor : ‐ cluster 
C compiler programming  
p
C post‐ gcc Cilk g
• Scalable to Large Number of Processors
source RTS
• Send (), Recv() construct
• It uses processes not thread
cilk2c translates  object ld binary
code • Not part of this course
straight C code into 
identical C postsource. linking
loader
$ cilkc fib.cilk –o fib 
$ ./fib  ‐proc 4  5       // run using  4 threads. information at runtime 

if(myid == source){   
#include <stdio.h>     
buffer=5678;       
#include <stdlib.h>    
MPI_Send(&buffer,count,MPI_INT, 
#include <mpi.h>   
destination,tag,MPI_COMM_WORLD);
#include <math.h>
printf("processor %d  sent %d\n",myid,buffer);
int main(  int argc, char *argv[])   {
}
int myid, numprocs, tag, source, destination, count,  buffer; 
if(myid == destination){
MPI_Status status;
MPI_Recv(&buffer,count,MPI_INT,
MPI_Init(&argc,&argv);
source,tag,MPI_COMM_WORLD,&status);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
printf("processor %d  got %d\n",myid,buffer);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
}
tag=1234; 
MPI_Finalize();
source=0;     destination=1;     count=1; 
}

• Exploits section that can be parallelized easily 
– Cuda GPU (Very simplistic, completely SPMD) • Classical Data structure (simply DS)
– OpenMP (A bit of more complex SPMD with control of  – Complexity of Add, Del, Search, Modify, Rearrange
reduction, critical and scheduling )
• Concurrent Data Structure  (CDS)
• How to handle critical path efficiently
gf
• Testing for Good Driver – Many thread access to CDS
– Driving 120kmph in express high way or drive for 2  – Consistency and Serialization 
hour in Guwahati city in evening 4PM to 6PM near  – Performance of locking
fancy bazar.
• How to overall speed up the application by trying  – Property should hold
to parallelize the not the easiest part   – No: live lock, deadlock, starvation
– The data storing, adding, removing and retrieving, 
organizing.. • Wait free and lock free CDS
– This leads to design Concurrent Data Strcuture

6
9/23/2014

• Java, C++11 and C# (Microsoft VC++)
– Java thread safe/concurrent collections • List, Queue, Stack, Hash, Priority queue, 
• /usr/share/javadoc/java‐1.6.0‐ skiplist
openjdk/api/java/util/concurrent/
– C++11  : through Boost  library  or  Libcds
• Discussion on internal implementation of 
– C# The Parallel Patterns Library (PPL)  some of these : list q stack pq hash
some of these : list, q, stack, pq, hash..
• concurrent_vector_class, Concurrent_queue_class • Use this CDS to make parallel application 
• All have competitive memory model faster 
• Use Java
– Still I think Java is better in CDS..
– Shavit: author of AMP Book used Java

• Ensure no thread can see a state where invariant 
of the DS have broken by the action of other  • Can the scope of locks be restricted to allow 
thread.  some parts of an operation to be performed 
• Take care to avoid race condition inherent in the  outside the lock?
interface to the DS by proving function for 
complete functions for the operation rather than  • Can different part of the DS be protected by 
for operation steps.  Race condition (unexpected 
f di i ( d different locks?
different locks? 
output) • Do all operations require the same level of 
• Pay attention to how the DS behaves in the  protection?
presence of exceptions to ensure that the 
invariants are not broken. • Can a simple change to the DS improve the 
• Minimized the opportunities for deadlock when  improve the opportunities for concurrency 
using the DS by restricting the scope of locked  without affecting the operational semantics ? 
and avoid nested locks where possible.

• More than one thread must be able to access the 
DS concurrently  • Lock free DS with property: every thread 
– Example: A Queue might allow one thread to push 
accessing the DS can complete its operation 
and others to pop but may not same push/pop. 
within bounded number of steps, regardless 
• If one thread accessing DS and get suspended by 
of behavior of other threads
of behavior of other threads.  
OS in midway , other thread should be able to 
access   without waiting for suspended thread • Coding for wait free DS correctly is extremely 
• Lock free may not be wait free: it may have  hard
starvation    
– 1TS  2T  1TS  2T, …..sequence lead to starvation by a 
RR scheduler

7
9/23/2014

.. spin
CS

critical Resets lock


lock section upon exit

• Each method locks the object • Sequential bottleneck
..
CS

–Avoid contention using queue locks  – Threads “stand in line” spin


lock
critical
section
Resets lock
upon exit

–Easy to reason about • Adding more threads
• In simple cases – Does not improve throughput
i h h
– Struggle to keep it from getting worse
• So, are we done? • So why even use a multiprocessor?
– Well, some apps inherently parallel …

• NOPE

You might also like