Parallel Computing
Parallel Computing
High-Performance Computing
(with Machine-Learning applications)
Part 1 Part 2
Microprocessor
trends: 1972--2015
Highest level of
parallelism:
• Compute nodes on an
interconnection network
• Possibly, thousands of nodes
• Distributed memory
• Distributed or shared address
space
• Scalability analysis crucial
Parallel computing platforms (cont.)
Node
ensemble
Nodes
GPUs
CPUs GPCs
SMXs
Cores
CUDA
cores
Parallel program hierarchy
Process
groups
Processes
Thread
pools
Threads
GPU thread
grids
GPU thread
blocks
GPU
threads
Parallel programming paradigms
Distributed Memory
• Multiple processes
• Distributed address
space
• Explicit data
movement
• Locality!
Parallel programming paradigms
Task Decomposition
Task Decomposition
Task Decomposition
(Example: sparse
matrix factorization)
Task decomposition example
Chess Program
R1 K1 B1 Q P1 P2
...
• Each task evaluates all moves of a single piece (branch-and-bound)
• Small data (board position) can be replicated
• Dynamic load balancing required
Data decomposition example
P1
P2
P3
P4
P5
P6 . =
P7
P8
P9
Parallel application design guidelines
E = S/p = TS/pTP
Scalability analysis
E = S/p = TS/pTP
E = S/p = TS/pTP
E = S/p = TS/pTP
W = TO . E/k(1-E)
Scalability analysis
E = S/p = TS/pTP
W = TO . E/k(1-E)
W ~ TO
Scalability analysis: W = O(n3)
Algorithm A Algorithm B
Algorithm A Algorithm B
Algorithm A Algorithm B
n ~ √p n1.5 ~ p
Scalability analysis: W = O(n3)
Algorithm A Algorithm B
n ~ √p n1.5 ~ p
P1
P2
P3
P4
P5 A x y
P6 . =
P7
P8
P9
Parallel algorithm design and analysis
P1 P2 P3 P1 P1
P5 P6
A . x = y
P7 P8 P9
Parallel matrix-vector multiplication:
1-D decomposition
Parallel matrix-vector multiplication:
2-D decomposition
Scalability analysis of matrix-vector
multiplication (1-D decomposition)
W ~ p2 W ~ plog2p