0% found this document useful (0 votes)
114 views

Intro Parallel Programming 2015

This document provides an introduction to parallel programming. It discusses how parallel computers work with multiple processors and memory. It covers different parallel programming models including shared memory and distributed memory. It also addresses challenges like load balancing and synchronization between processors.

Uploaded by

Ermin Sehic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

Intro Parallel Programming 2015

This document provides an introduction to parallel programming. It discusses how parallel computers work with multiple processors and memory. It covers different parallel programming models including shared memory and distributed memory. It also addresses challenges like load balancing and synchronization between processors.

Uploaded by

Ermin Sehic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

An Introduction to Parallel Programming

Paul Burton
April 2015

An Introduction to Parallel Programming


Introduction
 Syntax is easy
- And can always be found in books/web pages if you can’t
remember!
 How to think about parallel programming is more difficult
- But it’s essential!
- A good mental model enables you to use the OpenMP and MPI we
will teach you
- It can be a struggle to start with
- Persevere!
 What this module will cover
- Revision : What does a parallel computer look like
- Different programming models and how to think about them
- What is needed for best performance

An Introduction to Parallel Programming


What do we see? - How do we see!?

An Introduction to Parallel Programming


What does a computer do?

Processor

Program

Memory Memory

An Introduction to Parallel Programming


How do we make it go faster? [1]

 Make the processor go faster


- Give it a faster clock (more operations per second)
 Give the processor more ability
- For example – allow it to calculate a square root
 But…
- It gets very expensive to keep doing this
- Need to keep packing more onto a single silicon chip
 Need to make everything smaller
- Chips get increasingly complex
 Take longer to design and debug
- Difficult and very expensive for memory speed to keep up
- Produce more and more heat

An Introduction to Parallel Programming


How do we make it go faster? [2]
 Introduce multiple processors
 Advantages:
- “Many hands make light work”
- Each individual processor can be less powerful
 Which means it’s cheaper to buy and run (less power)
 Disadvantages
- “Too many cooks spoil the broth”
- One task – many processors
 We need to think about how to share the task amongst them
 We need to co-ordinate carefully
- We need a new way of writing our programs

An Introduction to Parallel Programming


Limits to parallel performance?
 Parallelisation is not a limitless way to infinite
performance!
 Algorithms and computer hardware give limits on
performance
 Amdahl’s Law
- Consider an algorithm (program!)
- Some parts of it (fraction “p”) can be run in parallel
- Some parts of it (fraction “s”) cannot be run in parallel
 Nature of the algorithm
 Hardware constraints (writing to a disk for example)
- Takes time “t” to run on a single processor
- On “n” processors it takes : T = s x t + (p x t)/n

An Introduction to Parallel Programming


Consequences of Amdahl’s Law [1]
 T = s x t + (p x t)/n
- Looks simple, but “s” has devastating consquences!
 Consider the case as the number of processors “n”
grows large, then we get:
- T = s x t + [something small]
 So our performance is limited by the non-parallel part of
our algorithm

An Introduction to Parallel Programming


Consequences of Amdahl’s Law [2]
 For example, assume we can parallelise 99% of our
algorithm, which takes 100 seconds on 1 processor.
 On 10 processors we get : T[10]= 0.01*100 + (0.99*100)/10
- T[10]=1 + 9.9 = 10.9 seconds
- 9.2 times speedup : not too bad - we’re “wasting” 8%
 But on 100 processors we get :
- T[100] = 1 + 0.99 = 1.99 seconds
- 50 times speedup : not so good – we’re “wasting” 50%
 And on 1000 processors we get :
- T[1000] = 1 + 0.099 = 1.099 seconds = 90 times speedup : terrible!
 We’re “wasting” 91%!

An Introduction to Parallel Programming


How do we program a parallel computer?

 Decompose (split) into parts


- Algorithm (the program) [eg. Car production line]
or
- Data [eg. Telephone call centre]
 Distribute the parts
- Multiple processors work simultaneously
 Algorithmic Considerations (algorithm/ data
dependencies)
- Need to ensure the work is properly synchronised
- Possibly need to communicate between processors
 Hardware Considerations
- What parallel architecture (hardware) are we using?

An Introduction to Parallel Programming


Parallel architectures (revision)
 Parallel programming technique will reflect the
architecture

Network

P P P P P P P P

M M M M M

Shared Memory Distributed Memory

An Introduction to Parallel Programming


Shared memory programming

 Split (decompose) the computation


- “Functional parallelism”
Each processor runs a
 Each thread works on a subset of
single “thread”
the computation
P P P P  No communication
- Implicit through common memory

 Advantages
- Easier to program
M
 no communications
 no need to decompose data
 Disadvantages
- Memory contention?
- How do we split an algorithm?

An Introduction to Parallel Programming


A simple program

INTEGER, PARAMETER :: SIZE=100


REAL, DIMENSION (SIZE) :: A,B,C,D,E,F
INTEGER :: i

! Read arrays A,B,C,D from a disk


CALL READ_DATA ( A , B , C , D , 100 )

! Calculate E=A+B
DO i = 1 , SIZE
E(i) = A(i) + B(i)
ENDDO
We’ll ignore this for
! Calculate F=C*D now…
DO i = 1 , SIZE
F(i) = C(i) * D(i)
ENDDO

! Write results
CALL WRITE_DATA( E , F , 100 )

An Introduction to Parallel Programming


A shared memory approach
 Split the function across the threads
- In the example we have two functions:
E=A+B and F=C*D
- But we have 4 processors (threads) – two would be idle 

 So what we do is split the computation of each loop


between the threads
 We need some new syntax to tell the compiler/computer
what we want it to do
- OpenMP – compiler directive
- For now we’ll just use some descriptive text

 We don’t really care which processor/thread does which


computations
- The shared memory means that each processor/thread can read/write to
any array element

An Introduction to Parallel Programming


Shared memory program

INTEGER, PARAMETER :: SIZE=100


REAL, DIMENSION (SIZE) :: A,B,C,D,E,F
INTEGER :: i

! Read arrays A,B,C,D from a disk


CALL READ_DATA ( A , B , C , D , 100 ) This is easy on a
shared memory
! Calculate E=A+B and F=C*D machine as all
! (Merged loops to fit onto slide!)
! OpenMP : Distribute loop over NPROC processors
threads can
! OpenMP : Private variables : i read/write to the
DO i = 1 , SIZE whole of each
E(i) = A(i) + B(i) array
F(i) = C(i) * D(i)
ENDDO
! Write results
CALL WRITE_DATA( E , F , 100 )

An Introduction to Parallel Programming


Directives
 Usually before a loop
 Tells the computer
- How many threads to split the iterations of the loop between
- Any variables which are “private” (default is that variables are
“shared”)
 “private” – each thread has an independent version of the
variable
 “shared” – all threads can read/write the same variable
 The loop index must be private - each thread must have its
own independent loop index so that it can keep track of what
it’s doing
- Optionally some tips on how to split the iterations of the loop
between threads

An Introduction to Parallel Programming


How to think about it
 The program runs on a single processor P1 – as a single
thread.
 Until…
- It meets an OpenMP directive (typically before a loop)
- This starts up the other processors (P2,P3,P4) – each running a
single “thread”
 Each thread takes a “chunk” of computations
 This is repeated until all the computations are done
- When the loop is finished (ENDDO) all the other processors
(P2,P3,P4) go back to sleep, and execution continues on a single
thread running on processor P1

An Introduction to Parallel Programming


How to do it
 Identify parts of the algorithm (typically loops) which can
be split (parallelised) between processors
 Possibly rewrite algorithm to allow it to be (more
efficiently) parallelised
- In our example we merged two loops – this can be more efficient
than starting up all the parallel threads multiple times
 For a given loop, identify any “private” variables
- eg. Loop index, partial sum etc.
 Insert a directive telling the computer how to split the
loop between processors

An Introduction to Parallel Programming


Distributed memory programming

Each processor runs a  Split (decompose) the data


single “task” - “Data Parallelism”

 Each processor/task works on


Network a subset of the data
 Processors communicate over
the network
P P P P
 Advantages
- Easily scalable (assuming a good
network)
M M M M
 Disadvantages
- Need to think about how to split our
data
- Need to think about dependencies and
communications

An Introduction to Parallel Programming


A distributed memory approach [1]
 Split (decompose) the data between the tasks
 We’ll need to do something clever for input/output of the
data
- We’ll ignore this for now
 Each task will compute its share of the full data set
- Shouldn’t be any problem with load balance (if we decompose the
data well)
 Computation is easy in this example
- No dependencies between different elements of the arrays
- If we had expressions like
A(i)=B(i-1)+B(i+1)
we would need to be a bit more clever…

An Introduction to Parallel Programming


A distributed memory approach [2]
 Split the data between processors
- Each processor will now have 25 (100 / 4) elements per array
- REAL, DIMENSION (SIZE/4) :: A,B,C,D,E,F

 Processor 1
- A(1) .. A(25) corresponds to
A(1) .. A(25) in the original (single processor code)

 Processor 2
- A(1) .. A(25) corresponds to
A(26) .. A(50) in the original (single processor code)

 Processor 3
- A(1) .. A(25) corresponds to
A(51) .. A(75) in the original (single processor code)

 Processor 4
- A(1) .. A(25) corresponds to
A(76) .. A(100) in the original (single processor code)

An Introduction to Parallel Programming


Distributed memory data mapping (array “A”)

A(1:100) 1 25 26 50 51 75 76 100

P1 A(1:25) 1 25

P2 A(1:25) 1 25

P3 A(1:25) 1 25

P4 A(1:25) 1 25

An Introduction to Parallel Programming


Distributed memory program

INTEGER, PARAMETER :: NPROC=4


INTEGER, PARAMETER :: SIZE=100/NPROC
REAL, DIMENSION (SIZE) :: A,B,C,D,E,F
INTEGER :: i

! Read arrays A,B,C,D from a disk


CALL READ_DATA ( A , B , C , D , 100 )

! Calculate E=A+B
DO i = 1 , SIZE
E(i) = A(i) + B(i)
ENDDO
We’ll ignore this for now
! Calculate F=C*D …
DO i = 1 , SIZE
F(i) = C(i) * D(i) But it is very important
ENDDO
and will need attention!
! Write results
CALL WRITE_DATA( E , F , 100 )

An Introduction to Parallel Programming


How to think about it
 Each task runs its own copy of the program
 Each task’s data is private to it
 Each task operates on a subset of the data
 Sometimes there are dependencies between data on
different tasks
- Tasks must explicitly communicate with one another
- Message Passing key concepts
 One task sends a message to one or more other tasks
 These tasks receive the message
 Synchronisation : All (or subset of) tasks wait until they have
all reached a certain point

An Introduction to Parallel Programming


How to do it
 Think about how to split (decompose) the data
- Minimize dependencies (which array dimension should we decompose?)
- Equal load balance (size of data and/or computation required)
- May need different decompositions in different parts of the code

 Add code to distribute input data across tasks


- And to collect when writing out

 Watch out for end cases / edge conditions


- For example code which implements a wrap-around at the boundaries
- First/Last item in a loop isn’t necessarily the real “edge” of the data on
every task
- Maybe some extra logic required to check

 Identify data dependencies


- Communicate data accordingly
- Add code to transpose data if changing decomposition

An Introduction to Parallel Programming


Decomposing Data [1]
j
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12
i
REAL, DIMENSION (12,4) :: OLD,NEW
DO j=1,4
DO i=2,11
NEW(i,j)=0.5*(OLD(i-1,j)+OLD(i+1,j))
ENDDO
ENDDO

An Introduction to Parallel Programming


Decomposing Data [2]
 Let’s think about decomposing the “i” dimension

j P1 P2 P3 P4
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12
i
NEW(i,j)=0.5*(OLD(i-1,j)+OLD(i+1,j))

 How do we calculate element (3,1) – on P1?


- We need element (2,1) which is on P1 – OK
- And element (4,1) which is on P2 – Oh!
 So we need to do some message passing

An Introduction to Parallel Programming


Decomposing Data [3]

 Let’s think about decomposing the “j” dimension


j
4
3 P4
2 P3
1
P2
1 2 3 4 5 6 7 8 9 10 11 12
P1
i
NEW(i,j)=0.5*(OLD(i-1,j)+OLD(i+1,j))
 Now no communication is needed
- This is a much better decomposition for this problem
 Not so easy in real life!
- Real codes often have dependencies in all dimensions
- Minimize communication or transpose
An Introduction to Parallel Programming
Shared & Distributed Memory programs
 Many (most!) HPC systems combine architectures
- A node is often a shared memory computer with a number of
processors and a single shared memory
- Memory is distributed between nodes
 Shared memory programming on a node
 Distributed memory programming between nodes

Network

P P P P
P P P P
P P P P P P P P
M
M
M M

An Introduction to Parallel Programming


Load Balancing

 Aim to have an equal computational load on each


processor
- Some processors sit idle waiting for others to complete some
work
- Maximum efficiency is gained when all processors are working

P1 Work Idle/Waste

synchronisation
P2 Work

P3 Work Idle/Waste

P4 Work Idle

time
An Introduction to Parallel Programming
Causes of Load Imbalance
 Different sized data on different processors
- Array dimensions and NPROC mean it’s impossible to
decompose data equally between processors
 Change dimensions, or collapse loop:
A(13,7) -> A(13*7)
- Regular geographical decomposition may not have equal work
points (eg. land/sea not uniformly distributed around globe)
 Different decompositions required
 Different load for different data points
- Physical parameterisations such as convection, short wave
radiation

An Introduction to Parallel Programming


Improving Load Balance : Distributed Memory

 Transpose data
- Change decomposition so as to minimize load imbalance
- Good solution if we can predict load per point (eg. land/sea)
 Implement a master/slave solution
- If we don’t know the load per point
IF (L_MASTER) THEN
DO chunk=1,nchunks
Wait for message from a slave
Send DATA(offset(chunk)) to that slave
ENDDO
Send “Finished” message to all slaves
ELSEIF (L_SLAVE) THEN
Send message to MASTER to say I’m ready to start
WHILE (“Finished” message not received) DO
Receive DATA(chunk_size) from MASTER processor
Compute DATA
Send DATA back to MASTER
ENDWHILE
ENDIF

An Introduction to Parallel Programming


Improving Load Balance : Shared memory

 Generally much easier


 In IFS we add an extra “artificial” dimension to arrays
- Allows arrays to be easily handled using OpenMP
 So we write loops like this:
REAL, DIMENSION (SIZE/NCHUNKS,NCHUNKS) :: A,B
! OpenMP : Distribute loop over NPROC (NPROC<=NCHUNKS) processors
! OpenMP : Private variables : chunk,i
DO chunk=1,NCHUNKS
DO i=1,SIZE/NCHUNKS
B(i,chunk)=Some_Complicated_Function(A(I,chunk))
ENDDO
ENDDO

 Make NCHUNKS >> NPROC


- Load balancing will happen automatically
 Other performance benefits by tuning inner loop size
An Introduction to Parallel Programming
Granularity
 The ratio between computation and communication
 “Fine-grain” parallelism
- Small number of compute instructions between synchronisations
- Reduces the changes needed to your algorithm
- Can amplify load balance problems
- Gives a high communications overhead
- Eventually the communications time will swamp the computation time
- Gets worse as you increase NPROC or decrease problem size

 “Coarse-grain” parallelism
- Long computations between communications
- Probably requires changes to your algorithm
- May get “natural” load balancing with more work with different inherent
load balance

 Best granularity is a dependent on your algorithm and


hardware
 Generally “coarse-grain” improves scalability
An Introduction to Parallel Programming
Steps to parallelisation (1)

 Identify parts of the program that can be executed in parallel


 Requires a thorough understanding of the algorithm
 Exploit any inherent parallelism which may exist
 Expose parallelism by
- Re-ordering the algorithm
- Tweaking to remove dependencies
- Complete reformulation to a new more parallel algorithm
- Google is your friend!
 You’re unlikely to be the first person to try and parallelise a
given algorithm!

An Introduction to Parallel Programming


Steps to parallelisation (2)
 Decompose the program
- Probably a combination of
 Data parallelism (hard!) for distributed memory
 Functional parallelism (easier, hopefully!) for shared
memory
- If you’re likely to need more than a few 10’s of processors to run
your problem then a distributed memory solution will be required
 Shared memory parallelism can be added as a second step,
and can be added to individual parts of the algorithm in
stages
- Identify the key data structures and data dependencies and how
best to decompose them

An Introduction to Parallel Programming


Steps to parallelisation (3)
 Code development
- Parallelisation may be influenced by your machine’s architecture
 But try to have a flexible design – you won’t use this
machine for ever!
- Decompose key data structures
- Add new data structures to describe and control the
decomposition (eg. offsets, mapping to/from global data,
neighbour identification)
- Identify data dependencies and add the necessary
communications
 And finally, the fun bit : CAT & DOG
- Compile And Test
- Debug, Optimise and Google!

An Introduction to Parallel Programming


Some questions to think about…
 Which do you think is easier to understand?
- Distributed memory parallelism (message passing) or shared
memory parallelism
 Which do you think is easier is implement?
 Which do you think might be easier to debug?
- Can you imagine the kind of errors that you might make and how
you might be able to find them?
 Do you think one may be more scalable than the other?
Why?
 Why should we have to do all this work anyway. Why
can’t the compiler do it all for us?

COM INTRO: Interpolation © ECMWF 2015 38

You might also like