0% found this document useful (0 votes)
14 views

02 cmsc416 Parallel

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

02 cmsc416 Parallel

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to Parallel Computing (CMSC416 / CMSC616)

Designing Parallel Programs


Abhinav Bhatele, Alan Sussman
Reminders / Annoucements

• If you do not have a zaratan account, email: [email protected]

• When emailing, please mention your course and section number:


• Example: 416 / Section 0201

• Accomodations: please get the letters to the respective instructors soon

• Join piazza: https://piazza.com/umd/fall2024/cmsc416cmsc616

• Assignment 0 will be posted tonight Sep 3 11:59 pm, due on Sep 10 11:59 pm

• Of ce hours have been posted on the website

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 2


fi
Writing parallel programs

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


Writing parallel programs

• Decide the serial algorithm rst

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
Writing parallel programs

SPMD model
• Decide the serial algorithm rst

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
Writing parallel programs

SPMD model
• Decide the serial algorithm rst

• Data: how to distribute data among threads/processes?


• Data locality: assignment of data to speci c processes to minimize data movement

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
fi
Writing parallel programs

SPMD model
• Decide the serial algorithm rst

• Data: how to distribute data among threads/processes?


• Data locality: assignment of data to speci c processes to minimize data movement

• Computation: how to divide work among threads/processes?

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
fi
Writing parallel programs

SPMD model
• Decide the serial algorithm rst

• Data: how to distribute data among threads/processes?


• Data locality: assignment of data to speci c processes to minimize data movement

• Computation: how to divide work among threads/processes?

• Figure out how often communication will be needed

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
fi
Conway’s Game of Life

• Two-dimensional grid of (square) cells

• Each cell can be in one of two states: live or dead

• Every cell only interacts with its eight nearest


neighbors

• In every generation (or iteration or time step),


there are some rules that decide if a cell will
continue to live or die or be born (dead ➜ live)
By Lev Kalmykov - Own work, CC BY-SA 4.0,
https://en.wikipedia.org/wiki/Conway's_Game_of_Life
https://commons.wikimedia.org/w/index.php?curid=43448735

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 4


Conway’s Game of Life

• Two-dimensional grid of (square) cells

• Each cell can be in one of two states: live or dead

• Every cell only interacts with its eight nearest


neighbors

• In every generation (or iteration or time step),


there are some rules that decide if a cell will
continue to live or die or be born (dead ➜ live)
By Lev Kalmykov - Own work, CC BY-SA 4.0,
https://en.wikipedia.org/wiki/Conway's_Game_of_Life
https://commons.wikimedia.org/w/index.php?curid=43448735

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 4


Two-dimensional stencil computation
2D 5-point Stencil

• Commonly found kernel in computational codes

• Heat diffusion, Jacobi method, Gauss-Seidel method

A[i, j] + A[i − 1,j] + A[i + 1,j] + A[i, j − 1] + A[i, j + 1]


A[i, j] =
5

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 5


Two-dimensional stencil computation
2D 5-point Stencil

• Commonly found kernel in computational codes

• Heat diffusion, Jacobi method, Gauss-Seidel method

A[i, j] + A[i − 1,j] + A[i + 1,j] + A[i, j − 1] + A[i, j + 1]


A[i, j] =
5
3D 7-point Stencil
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 5
Serial code

for(int t=0; t<num_steps; t++) {


...

// copy contents of A_new into A

for(i ...)
for(j ...)
A_new[i, j] = (A[i, j] + A[i-1, j] + A[i+1, j] + A[i, j-1] + A[i, j+1]) * 0.2

...
}

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 6


Serial code

Why do we keep two


for(int t=0; t<num_steps; t++) { copies of A?
...

// copy contents of A_new into A

for(i ...)
for(j ...)
A_new[i, j] = (A[i, j] + A[i-1, j] + A[i+1, j] + A[i, j-1] + A[i, j+1]) * 0.2

...
}

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 6


Serial code

Why do we keep two


for(int t=0; t<num_steps; t++) { copies of A?
...

// copy contents of A_new into A

for(i ...)
for(j ...)
A_new[i, j] = (A[i, j] + A[i-1, j] + A[i+1, j] + A[i, j-1] + A[i, j+1]) * 0.2

...
} For correctness, we have to ensure that
elements in A are not written into before they
are read in the same timestep / iteration

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 6


2D stencil computation in parallel

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


2D stencil computation in parallel

• 1D decomposition
• Divide rows (or columns) among processes

• Each process has to communicate with two


neighbors (above and below)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


2D stencil computation in parallel

• 1D decomposition
• Divide rows (or columns) among processes

• Each process has to communicate with two


neighbors (above and below)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


2D stencil computation in parallel

• 1D decomposition
• Divide rows (or columns) among processes

• Each process has to communicate with two


neighbors (above and below) Ghost cells

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


2D stencil computation in parallel

• 1D decomposition
• Divide rows (or columns) among processes

• Each process has to communicate with two


neighbors (above and below) Ghost cells

• 2D decomposition
• Divide both rows and columns (2d blocks)
among processes

• Each process has to communicate with four


neighbors

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


2D stencil computation in parallel

• 1D decomposition
• Divide rows (or columns) among processes

• Each process has to communicate with two


neighbors (above and below) Ghost cells

• 2D decomposition
• Divide both rows and columns (2d blocks)
among processes

• Each process has to communicate with four


neighbors

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


Prefix sum
• Calculate sums of pre xes (running totals) of elements (numbers) in an array

• Also called a “scan” sometimes

pSum[0] = A[0]

for(i=1; i<N; i++) {


pSum[i] = pSum[i-1] + A[i]
}

A 1 2 3 4 5 6 …
pSum 1 3 6 10 15 21 …

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8


fi
Parallel prefix sum

2 8 3 5 7 4 1 6

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


Parallel prefix sum
Processes/
0 1 2 3 4 5 6 7
threads
2 8 3 5 7 4 1 6

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


Parallel prefix sum
Processes/
0 1 2 3 4 5 6 7
threads
2 8 3 5 7 4 1 6

Stride 1

2 10 11 8 12 11 5 7

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


Parallel prefix sum
Processes/
0 1 2 3 4 5 6 7
threads
2 8 3 5 7 4 1 6

Stride 1

2 10 11 8 12 11 5 7

Stride 2

2 10 13 18 23 19 17 18

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


Parallel prefix sum
Processes/
0 1 2 3 4 5 6 7
threads
2 8 3 5 7 4 1 6

Stride 1

2 10 11 8 12 11 5 7

Stride 2

2 10 13 18 23 19 17 18

Stride 4

2 10 13 18 25 29 30 36

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


In practice

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


In practice

• You have N numbers and p processes, N >> p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


In practice

• You have N numbers and p processes, N >> p

• Assign a N/p block to each process


• Do the serial pre x sum calculation for the blocks owned on each process locally

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


fi
In practice

• You have N numbers and p processes, N >> p

• Assign a N/p block to each process


• Do the serial pre x sum calculation for the blocks owned on each process locally

• Then do parallel algorithm with partial pre x sums (using the last element from each
local block)
• Last element from sending process is added to all elements in receiving process’ sub-block

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


fi
fi
Load balance and grain size

• Load balance: try to balance the amount of work (computation) assigned to different
threads/ processes
• Bring ratio of maximum to average load as close to 1.0 as possible

• Secondary consideration: also load balance amount of communication

• Grain size: ratio of computation-to-communication


• Coarse-grained (more computation) vs. ne-grained (more communication)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 11


fi

You might also like