Slides
Slides
David Head
University of Leeds
Previous lectures
Today’s lecture
1
McCool et al., Structured parallel programming (Morgan-Kaufman, 2012).
Vector addition
a=( a1 , a2 , a3 , ..., an )
+ + + + +
b=( b1 , b2 , b3 , ... bn )
↓ ↓ ↓ ↓ ↓
c=( c1 , c2 , c3 ..., cn )
Or:
ci = ai + b i , i = 1 . . . n.
1 # define n 100
2
3 int main ()
4 {
5 float a [ n ] , b [ n ] , c [ n ];
6
7 ... // Initialise a [ n ] and b [ n ]
8
9 int i ;
10 for ( i =0; i < n ; i ++ )
11 c [ i ] = a [ i ] + b [ i ];
12
13 return 0;
14 }
Note that indices usually start at 0 for most languages, but 1 for
the usual mathematical notation (also FORTRAN, MATLAB).
This only parallelises this one loop, not any later ones!
David Head COMP3221 Parallel Computation
Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel
Fork-and-join
When the executable reaches #pragma omp parallel for, it
spawns multiple threads.
Each thread computes part of the loop.
The extra threads are destroyed at the end of the loop.
This is known as a fork-join construct:
Fork: Main
Main thread spawns
thread worker threads
(serial)
#pragma omp parallel for
for() Main and
{ worker
... threads perform
} the loop
Worker thread 1:
1 // CREATED BY MAIN ( ‘ fork ’)
2 // Perform second 1/4 of loop .
3 for ( i = n /4; i < n /2; i ++ ) c [ i ] = a [ i ] + b [ i ];
4 // FINISH ( ‘ join ’)
Worker thread 2:
1 // CREATED BY MAIN ( ‘ fork ’)
2 // Perform third 1/4 of loop .
3 for ( i = n /2; i <3* n /4; i ++ ) c [ i ] = a [ i ] + b [ i ];
4 // FINISH ( ‘ join ’)
Worker thread 3:
1 // CREATED BY MAIN ( ‘ fork ’)
2 // Perform final 1/4 of loop .
3 for ( i =3* n /4; i < n ; i ++ ) c [ i ] = a [ i ] + b [ i ];
4 // FINISH ( ‘ join ’)
Notes
The four threads are not being executed one after the other:
Each thread runs concurrently, hopefully on separate cores,
i.e., in parallel.
Cannot be understood in terms of serial programming
concepts.
The total loop range was evenly divided between all threads.
Happens as soon as #pragma omp parallel for reached.
The trip count (i.e. loop range) must be known at the start
of the loop.
The start, end and stride must be constant.
Cannot break from the loop.
Cannot apply to ‘while. . . do’ or ‘do. . . while’ loops
Code snippet
1
More concisely represented as complex numbers c and z [with e.g.
zx = ℜ(z)], then the iteration is just z → z 2 + c.
This works, but is not much faster than serial – and may even be
slower (check on your system).
Multiple possibilities for this:
The fork-join is inside the j-loop, so threads are created and
destroyed numPixels y times, which incurs an overhead.
This problem suffers from poor load balancing; see later.
Parallelise only the outer loop, so there is only a single fork event
and a single join event.
1 int i , j ;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 setPixelColour ( i , j ) ;
7 }
The same variable i for the inner loop counter is being updated
by all threads:
When one thread completes a calculation, it increments i.
Therefore other threads will skip at least one pixel.
Threads do not calculate the full line of pixels.
The code now works, but is still not much faster than serial.
The primary overhead is poor load balancing. We will look
at this next lecture briefly, and detail in Lecture 13.
Notice that the incorrect images were slightly different each time:
e.g. 1. e.g. 2.
Our serial code was deterministic, i.e. produced the same results
each time it was run.