Intro Parallel Programming 2015
Intro Parallel Programming 2015
Paul Burton
April 2015
Processor
Program
Memory Memory
Network
P P P P P P P P
M M M M M
Advantages
- Easier to program
M
no communications
no need to decompose data
Disadvantages
- Memory contention?
- How do we split an algorithm?
! Calculate E=A+B
DO i = 1 , SIZE
E(i) = A(i) + B(i)
ENDDO
We’ll ignore this for
! Calculate F=C*D now…
DO i = 1 , SIZE
F(i) = C(i) * D(i)
ENDDO
! Write results
CALL WRITE_DATA( E , F , 100 )
Processor 1
- A(1) .. A(25) corresponds to
A(1) .. A(25) in the original (single processor code)
Processor 2
- A(1) .. A(25) corresponds to
A(26) .. A(50) in the original (single processor code)
Processor 3
- A(1) .. A(25) corresponds to
A(51) .. A(75) in the original (single processor code)
Processor 4
- A(1) .. A(25) corresponds to
A(76) .. A(100) in the original (single processor code)
A(1:100) 1 25 26 50 51 75 76 100
P1 A(1:25) 1 25
P2 A(1:25) 1 25
P3 A(1:25) 1 25
P4 A(1:25) 1 25
! Calculate E=A+B
DO i = 1 , SIZE
E(i) = A(i) + B(i)
ENDDO
We’ll ignore this for now
! Calculate F=C*D …
DO i = 1 , SIZE
F(i) = C(i) * D(i) But it is very important
ENDDO
and will need attention!
! Write results
CALL WRITE_DATA( E , F , 100 )
j P1 P2 P3 P4
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12
i
NEW(i,j)=0.5*(OLD(i-1,j)+OLD(i+1,j))
Network
P P P P
P P P P
P P P P P P P P
M
M
M M
P1 Work Idle/Waste
synchronisation
P2 Work
P3 Work Idle/Waste
P4 Work Idle
time
An Introduction to Parallel Programming
Causes of Load Imbalance
Different sized data on different processors
- Array dimensions and NPROC mean it’s impossible to
decompose data equally between processors
Change dimensions, or collapse loop:
A(13,7) -> A(13*7)
- Regular geographical decomposition may not have equal work
points (eg. land/sea not uniformly distributed around globe)
Different decompositions required
Different load for different data points
- Physical parameterisations such as convection, short wave
radiation
Transpose data
- Change decomposition so as to minimize load imbalance
- Good solution if we can predict load per point (eg. land/sea)
Implement a master/slave solution
- If we don’t know the load per point
IF (L_MASTER) THEN
DO chunk=1,nchunks
Wait for message from a slave
Send DATA(offset(chunk)) to that slave
ENDDO
Send “Finished” message to all slaves
ELSEIF (L_SLAVE) THEN
Send message to MASTER to say I’m ready to start
WHILE (“Finished” message not received) DO
Receive DATA(chunk_size) from MASTER processor
Compute DATA
Send DATA back to MASTER
ENDWHILE
ENDIF
“Coarse-grain” parallelism
- Long computations between communications
- Probably requires changes to your algorithm
- May get “natural” load balancing with more work with different inherent
load balance