0% found this document useful (0 votes)
1 views24 pages

Slides

The document discusses multi-threaded vector addition and parallel computation concepts, focusing on data parallel problems and the use of OpenMP for parallelizing loops. It provides examples of vector addition in both serial and parallel forms, explaining the fork-join model and the implications of thread management. Additionally, it covers the parallelization of nested loops using the Mandelbrot set as a case study, highlighting common pitfalls and solutions in achieving correct and efficient parallel execution.

Uploaded by

fajeri7083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views24 pages

Slides

The document discusses multi-threaded vector addition and parallel computation concepts, focusing on data parallel problems and the use of OpenMP for parallelizing loops. It provides examples of vector addition in both serial and parallel forms, explaining the fork-join model and the implications of thread management. Additionally, it covers the parallelization of nested loops using the Mandelbrot set as a case study, highlighting common pitfalls and solutions in achieving correct and efficient parallel execution.

Uploaded by

fajeri7083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Overview

Multi-threaded vector addition


Nested loops in parallel
Summary and next lecture

COMP3221 Parallel Computation

David Head

University of Leeds

Lecture 3: Data parallel problems

David Head COMP3221 Parallel Computation


Overview
Multi-threaded vector addition Previous lectures
Nested loops in parallel Today’s lecture
Summary and next lecture

Previous lectures

In the last lecture we started looking at shared memory


parallelism (SMP):
Relevant to multi-core CPUs.
Separate processing units (cores) share some levels of
memory cache.
Various frameworks for programming SMP systems.
Widely-implemented standard: OpenMP.

David Head COMP3221 Parallel Computation


Overview
Multi-threaded vector addition Previous lectures
Nested loops in parallel Today’s lecture
Summary and next lecture

Today’s lecture

Today we are going to look at a some actual problems.


Examples of a data parallel problems, where the same
operation is applied to multiple data elements.
Also known as a map1 .
Multi-threading solution employs a fork-join pattern.
How to parallelise nested loops.
Parallel code can be non-deterministic, even when the serial
code is deterministic.

1
McCool et al., Structured parallel programming (Morgan-Kaufman, 2012).

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Vector addition

An n-vector a can be thought of as an array of n numbers:


a = (a1 , a2 , . . . , an ).
If two vectors a and b are the same size, they can be added to
generate a new n-vector c:

a=( a1 , a2 , a3 , ..., an )
+ + + + +
b=( b1 , b2 , b3 , ... bn )
↓ ↓ ↓ ↓ ↓
c=( c1 , c2 , c3 ..., cn )

Or:
ci = ai + b i , i = 1 . . . n.

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Serial vector addition


Code on Minerva: vectorAddition serial.c

1 # define n 100
2
3 int main ()
4 {
5 float a [ n ] , b [ n ] , c [ n ];
6
7 ... // Initialise a [ n ] and b [ n ]
8
9 int i ;
10 for ( i =0; i < n ; i ++ )
11 c [ i ] = a [ i ] + b [ i ];
12
13 return 0;
14 }

Note that indices usually start at 0 for most languages, but 1 for
the usual mathematical notation (also FORTRAN, MATLAB).

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Vector addition in parallel


Code on Minerva: vectorAddition parallel.c

Add #pragma omp parallel for just before the loop:


1 # define n 100
2
3 int main ()
4 {
5 float a [ n ] , b [ n ] , c [ n ];
6
7 ... // Initialise a [ n ] and b [ n ]
8
9 int i ;
10 # pragma omp parallel for
11 for ( i =0; i < n ; i ++ )
12 c [ i ] = a [ i ] + b [ i ];
13
14 return 0;
15 }

This only parallelises this one loop, not any later ones!
David Head COMP3221 Parallel Computation
Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Fork-and-join
When the executable reaches #pragma omp parallel for, it
spawns multiple threads.
Each thread computes part of the loop.
The extra threads are destroyed at the end of the loop.
This is known as a fork-join construct:
Fork: Main
Main thread spawns
thread worker threads
(serial)
#pragma omp parallel for
for() Main and
{ worker
... threads perform
} the loop

Main Join: Main


thread thread waits until
(serial) workers finish

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Example: Four threads in total


Pseudocode for the main thread:
1 // Main thread starts in serial
2 // Initialise arrays a , b ; allocate c .
3 ...
4 // REACHES # pragma omp parallel for
5 // FORK : Create three new threads .
6 worker1 = fork (...) ;
7 worker2 = fork (...) ;
8 worker3 = fork (...) ;
9
10 // Perform 1/4 of the total loop .
11 for ( i =0; i < n /4; i ++ )
12 c [ i ] = a [ i ] + b [ i ];
13
14 // JOIN : Wait for other threads to finish .
15 worker1 . join () ;
16 worker2 . join () ;
17 worker3 . join () ;
18
19 // Continue in serial after the loop

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Worker thread 1:
1 // CREATED BY MAIN ( ‘ fork ’)
2 // Perform second 1/4 of loop .
3 for ( i = n /4; i < n /2; i ++ ) c [ i ] = a [ i ] + b [ i ];
4 // FINISH ( ‘ join ’)

Worker thread 2:
1 // CREATED BY MAIN ( ‘ fork ’)
2 // Perform third 1/4 of loop .
3 for ( i = n /2; i <3* n /4; i ++ ) c [ i ] = a [ i ] + b [ i ];
4 // FINISH ( ‘ join ’)

Worker thread 3:
1 // CREATED BY MAIN ( ‘ fork ’)
2 // Perform final 1/4 of loop .
3 for ( i =3* n /4; i < n ; i ++ ) c [ i ] = a [ i ] + b [ i ];
4 // FINISH ( ‘ join ’)

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Notes

The four threads are not being executed one after the other:
Each thread runs concurrently, hopefully on separate cores,
i.e., in parallel.
Cannot be understood in terms of serial programming
concepts.

Each thread performs the same operations on different data.


Would be SIMD in Flynn’s taxonomy, except this is
implemented in software on a MIMD device.

Have assumed n is divisible by the number of threads for clarity.


Generalising to arbitrary n is not difficult, but obscures the
parallel aspects.

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

#pragma omp parallel for

The total loop range was evenly divided between all threads.
Happens as soon as #pragma omp parallel for reached.
The trip count (i.e. loop range) must be known at the start
of the loop.
The start, end and stride must be constant.
Cannot break from the loop.
Cannot apply to ‘while. . . do’ or ‘do. . . while’ loops

David Head COMP3221 Parallel Computation


Overview Vector addition in serial
Multi-threaded vector addition Vector addition in parallel
Nested loops in parallel Thread-level description
Summary and next lecture Data parallel and embarrassingly parallel

Data parallel and embarrassingly parallel

This is an example of a data parallel problem or a map:


Array elements distributed evenly over the threads.
Same operation performed on all elements.
Suitable for the SIMD model.

In fact, this example is so straightforward to parallelise that is also


sometimes referred to as an embarrassingly parallel problem.
Easy to get working correctly in parallel.
May still be a challenge to achieve good parallel performance.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

Mandelbrot set generator


Code on Minerva: Mandelbrot.c, makefile

Classic computationally intensive problem in two dimensions that


used to be used as a benchmark for processor speeds:

Loops over pixels, i.e. a


two dimensional, nested
double loop.
Colour of each pixel
calculated independently
of all other pixels.
Each colour calculation
requires many floating
point operations.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

Code snippet

The part of the code that interests us here is shown below:


1 // Change the colour arrays for the whole image .
2 int i , j ;
3 for ( j =0; j < numPixels_y ; j ++ )
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 // Set the colour of pixel (i , j ) , i . e . modify the values
of red [ i ][ j ] , green [ i ][ j ] , and / or blue [ i ][ j ].
7 setPixelColour ( i , j ) ;
8 }

Note the i-loop is nested inside the j-loop.

The graphical output is performed in OpenGL/GLFW. A simple


makefile has been included that should work on school machines.
You may need to modify this for your system’s OpenGL/GLFW.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

What setPixelColour does


Purely for background interest, here’s how the colours are
calculated:
Each pixel i,j is converted to floating point numbers cx ,
cy , both in the range -2 to 2.
Two other floats zx and zy are initialised to zero.
The following iteration1 is performed until zx2 + zy2 ≥ 4, or a
maximum number of iterations maxIters is reached:

(zx , zy ) → (zx2 − zy2 + cx , 2zx zy + cy )

The colour is selected based on the number of iterations.

1
More concisely represented as complex numbers c and z [with e.g.
zx = ℜ(z)], then the iteration is just z → z 2 + c.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

Parallel Mandelbrot: First attempt


Parallelise only the inner loop.
1 int i , j ;
2 for ( j =0; j < numPixels_y ; j ++ )
3 # pragma omp parallel for
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 setPixelColour ( i , j ) ;
7 }

This works, but is not much faster than serial – and may even be
slower (check on your system).
Multiple possibilities for this:
The fork-join is inside the j-loop, so threads are created and
destroyed numPixels y times, which incurs an overhead.
This problem suffers from poor load balancing; see later.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

Parallel Mandelbrot: Second attempt

Parallelise only the outer loop, so there is only a single fork event
and a single join event.
1 int i , j ;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 setPixelColour ( i , j ) ;
7 }

This is faster . . . but wrong!


A distorted image results.
The distortion is different each time the program is
executed.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

The same variable i for the inner loop counter is being updated
by all threads:
When one thread completes a calculation, it increments i.
Therefore other threads will skip at least one pixel.
Threads do not calculate the full line of pixels.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

Parallel Mandelbrot: Third attempt


Make the inner loop variable i private to each thread:
1 int j ;
2 # pragma omp parallel for
3 for ( j =0; j < numPixels_y ; j ++ )
4 {
5 int i ;
6 for ( i =0; i < numPixels_x ; i ++ )
7 {
8 setPixelColour ( i , j ) ;
9 }
10 }

. . . or (for compilers following the C99 standard):


1 # pragma omp parallel for
2 for ( int j =0; j < numPixels_y ; j ++ )
3 for ( int i =0; i < numPixels_x ; i ++ )
4 {
5 setPixelColour ( i , j ) ;
6 }

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

The private clause

A third way to solve this is to use OpenMP’s private clause:


1 int i , j ;
2 # pragma omp parallel for private ( i )
3 for ( j =0; j < numPixels_y ; j ++ )
4 for ( i =0; i < numPixels_x ; i ++ )
5 {
6 setPixelColour ( i , j ) ;
7 }

Creates a copy of i for each thread.


Multiple variables may be listed, e.g. private(i,a,b,c)

The code now works, but is still not much faster than serial.
The primary overhead is poor load balancing. We will look
at this next lecture briefly, and detail in Lecture 13.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

The collapse clause


The collapse clause replaces 2 or more nested loops with a single
loop, at the expense of additional internal calculations.
1 # pragma omp parallel for collapse (2)
2 for ( int j =0; j < numPixels_y ; j ++ )
3 for ( int i =0; i < numPixels_x ; i ++ )
4 setPixelColour ( i , j ) ;

is equivalent to (but more readable than)


1 # pragma omp parallel for
2 for ( int k = 0; k < numPixels_x * numPixels_y ; k ++ )
3 {
4 int
5 i = k % numPixels_x ,
6 j = k / numPixels_x ;
7 setPixelColour ( i , j ) ;
8 }

This is principally intended for short loops that cannot be equally


distributed across all threads.
David Head COMP3221 Parallel Computation
Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

Determinism and non-determinism

Notice that the incorrect images were slightly different each time:

e.g. 1. e.g. 2.

The pixels plotted depend on the order in which threads update


the shared variable i, which depends on the thread scheduler.
Will be influenced by factors outside our control.
e.g. the various background tasks that every OS must run.

David Head COMP3221 Parallel Computation


Overview Worked example: Mandelbrot set
Multi-threaded vector addition Parallelising nested loops
Nested loops in parallel The collapse clause
Summary and next lecture Determinsim and non-determinism

Our serial code was deterministic, i.e. produced the same results
each time it was run.

By contrast, our (incorrect) parallel code was non-deterministic.

Often this is the result of an error, but can sometimes be useful:


Some algorithms, often in science and engineering, do not care
about non-deterministic errors as long as they are small.
Strictly imposing determinism may result in additional
overheads and performance loss.
However, for this module we will try to develop parallel algorithms
whose results match that of the serial equivalent.

David Head COMP3221 Parallel Computation


Overview
Multi-threaded vector addition
Summary and next lecture
Nested loops in parallel
Summary and next lecture

Summary and next lecture

Today we have look at data parallel problems or maps, where the


same operation is applied to multiple data members.
Distribute data evenly across threads.
Sometimes referred to as embarrassingly parallel.
In two lectures time we will start looking at more complex problems
for which the calculations on different threads are not independent.

Before then, we need to learn the vocabulary of parallel theory,


which is the topic of next lecture.

David Head COMP3221 Parallel Computation

You might also like