0% found this document useful (0 votes)
76 views

Parallel Computing

This document introduces parallel and high-performance computing concepts with a focus on machine learning applications. It covers parallel computing basics including why parallelism is needed, computing platforms, algorithm design principles like decomposition, and programming models like distributed memory, shared memory and accelerators. The document is split into two parts, with part one focusing on parallel computing fundamentals and part two covering parallel algorithms for machine learning tasks like classification.

Uploaded by

Aaqib Inam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Parallel Computing

This document introduces parallel and high-performance computing concepts with a focus on machine learning applications. It covers parallel computing basics including why parallelism is needed, computing platforms, algorithm design principles like decomposition, and programming models like distributed memory, shared memory and accelerators. The document is split into two parts, with part one focusing on parallel computing fundamentals and part two covering parallel algorithms for machine learning tasks like classification.

Uploaded by

Aaqib Inam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Introduction to Parallel and

High-Performance Computing
(with Machine-Learning applications)

Anshul Gupta, IBM Research


Prabhanjan (Anju) Kambadur, Bloomberg L.P.
Introduction to Parallel and High-Performance
Computing (with Machine-Learning applications)

Part 1 Part 2

Parallel computing Parallel algorithms for


basics and parallel Building a Classifier
algorithm analysis
Part 1: Parallel and High-Performance Computing

• Why parallel computing?


Part 1: Parallel and High-Performance Computing

• Why parallel computing?


• Parallel computing platforms
Part 1: Parallel and High-Performance Computing

• Why parallel computing?


• Parallel computing platforms
• Parallel algorithm basics
Part 1: Parallel and High-Performance Computing

• Why parallel computing?


• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
Part 1: Parallel and High-Performance Computing

• Why parallel computing?


• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
• Parallel programming models
Part 1: Parallel and High-Performance Computing

• Why parallel computing?


• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
• Parallel programming models
• Parallel algorithm analysis
Why parallel computing?

Microprocessor
trends: 1972--2015

Kirk M. Bresniker, Sharad


Singhal, R. Stanley Williams,
"Adapting to Thrive in a New
Economy of Memory
Abundance",Computer, vol.48,
no. 12, pp. 44-53, Dec. 2015,
doi:10.1109/MC.2015.368
Parallel computing platforms

Highest level of
parallelism:

• Compute nodes on an
interconnection network
• Possibly, thousands of nodes
• Distributed memory
• Distributed or shared address
space
• Scalability analysis crucial
Parallel computing platforms (cont.)

A generalized compute node

• (Possibly) multiple CPUs


• (Possibly) multiple GPUs
Parallel computing platforms (cont.)
Parallel computing platforms (cont.)

Shared memory on a node, but


Non-Uniform Memory Access
(NUMA).
Parallel computing platforms (cont.)

Nvidia GeForce GTX 680 (Kepler)

• 4 GPCs (graphics processing


clusters)
• 8 SMXs (streaming
multiprocessors)
• 192 X 8 = 1536 CUDA cores
Parallel hardware hierarchy

Node
ensemble

Nodes

GPUs

CPUs GPCs

SMXs

Cores

CUDA
cores
Parallel program hierarchy

Process
groups

Processes

Thread
pools

Threads

GPU thread
grids
GPU thread
blocks

GPU
threads
Parallel programming paradigms

Distributed Memory Shared Memory Accelerator


Parallel programming paradigms

Distributed Memory

• Multiple processes
• Distributed address
space
• Explicit data
movement
• Locality!
Parallel programming paradigms

Distributed Memory Shared Memory

• Multiple processes • Multiple threads


• Distributed address • Shared address space
space
• Implicit data
• Explicit data movement
movement • Locality!
• Locality!
Parallel programming paradigms

Distributed Memory Shared Memory Accelerator

• Multiple processes • Multiple threads • Host memory  PCIe


• Distributed address • Shared address space  Device memory
space • Explicit data
• Implicit data
movement
• Explicit data movement
movement • Locality! • Locality!
• Locality!
Algorithm design

• Algorithm design is critical to devising a computing


solution
Algorithm design

• Algorithm design is critical to devising a computation


solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
Algorithm design

• Algorithm design is critical to devising a computation


solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
• Parallel algorithm is a recipe for solving the given
problem using an ensemble of hardware resources
Algorithm design

• Algorithm design is critical to devising a computation


solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
• Parallel algorithm is a recipe for solving the given
problem using an ensemble of hardware resources
• Specifying parallel algorithm involves a lot more than
specifying a sequence of basic steps
Algorithm design

• Algorithm design is critical to devising a computation


solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
• Parallel algorithm is a recipe for solving the given
problem using an ensemble of hardware resources
• Specifying parallel algorithm involves a lot more than
specifying a sequence of basic steps
Parallel algorithm design steps

• Identifying portions of work that can be performed


concurrently
Parallel algorithm design steps

• Identifying portions of work that can be performed


concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
Parallel algorithm design steps

• Identifying portions of work that can be performed


concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to
the right computing agent at the right time
Parallel algorithm design steps

• Identifying portions of work that can be performed


concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to
the right computing agent at the right time
• Managing simultaneous requests for shared data
Parallel algorithm design steps

• Identifying portions of work that can be performed


concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to
the right computing agent at the right time
• Managing simultaneous requests for shared data
• Synchronizing computing agents for correct program execution
Parallel algorithm design steps

• Identifying portions of work that can be performed concurrently


- Decomposition
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to the
right computing agent at the right time – Data Dependencies
• Managing simultaneous requests for shared data
• Synchronizing computing agents for correct program execution –
Task Dependencies
Decomposition for concurrency

Task Decomposition Data Decomposition


Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified


and mapped onto to threads or
processes
Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified


and mapped onto to threads or
processes
• Tasks share or exchange data
as needed
Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified


and mapped onto to threads or
processes
• Tasks share or exchange data
as needed
• May be static or dynamic
Decomposition for concurrency

Task Decomposition Data Decomposition

• Concurrent tasks are identified • Data is partitioned (input,


and mapped onto to threads or output, or intermediate)
processes
• Tasks share or exchange data
as needed
• May be static or dynamic
Decomposition for concurrency

Task Decomposition Data Decomposition

• Concurrent tasks are identified • Data is partitioned


and mapped onto to threads or • Partitions are assigned to
processes computing agents
• Tasks share or exchange data • “Owner computes” rule
as needed
• May be static or dynamic
Decomposition for concurrency

Task Decomposition Data Decomposition

• Concurrent tasks are identified • Data is partitioned


and mapped onto to threads or • Partitions are assigned to
processes computing agents
• Tasks share or exchange data • “Owner computes” rule
as needed
• Usually static
• May be static or dynamic
Decomposition for concurrency

Data + Task = Hybrid


Decomposition Decomposition Decompositions

(Example: sparse
matrix factorization)
Task decomposition example

Chess Program

R1 K1 B1 Q P1 P2

...
• Each task evaluates all moves of a single piece (branch-and-bound)
• Small data (board position) can be replicated
• Dynamic load balancing required
Data decomposition example

Dense Matrix-Vector Multiplication

P1
P2
P3
P4
P5
P6 . =
P7
P8
P9
Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to


bottom.
Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to


bottom.
• Devise the best decomposition strategy at the given level
Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to


bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to


bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data
movement among computing agents
Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to


bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data
movement among computing agents
• Minimize load imbalance and idling among computing
agents
Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to


bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data
movement among computing agents
• Minimize load imbalance and idling among computing
agents
Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a


single computing agent
Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a


single computing agent
• Problem size, W = total amount of work: TS = kW
Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a


single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a


single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a


single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a


single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
• Efficiency, fraction of overall time spent doing useful work:
E = S/p = TS/pTP = TS/(TS+TO)
Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a


single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
• Efficiency, fraction of overall time spent doing useful work:
E = S/p = TS/pTP = TS/(TS+TO)
Parallel algorithm analysis
Parallel algorithm analysis
Isoefficiency function

• Function fE(p) of the number


of computing agents p by
which the problem size W
must grow in order to maintain
a given efficiency E.
Isoefficiency function

• Function fE(p) of the number


of computing agents p by
which the problem size W
must grow with p in order to
maintain a given efficiency E.
• Captures the effect of
communication, load-
imbalance, contention, serial-
bottlenecks, etc.
Isoefficiency function
Scalability analysis

E = S/p = TS/pTP
Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO


Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO

Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW


Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO

Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW

W = TO . E/k(1-E)
Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO

Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW

W = TO . E/k(1-E)

W ~ TO
Scalability analysis: W = O(n3)

Algorithm A Algorithm B

TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)


Scalability analysis: W = O(n3)

Algorithm A Algorithm B

TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)

W = O(n3) => n3 ~ n2√p W = O(n3) => n3 ~ n1.5p


Scalability analysis: W = O(n3)

Algorithm A Algorithm B

TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)

W = O(n3) => n3 ~ n2√p W = O(n3) => n3 ~ n1.5p

n ~ √p n1.5 ~ p
Scalability analysis: W = O(n3)

Algorithm A Algorithm B

TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)

W = O(n3) => n3 ~ n2√p W = O(n3) => n3 ~ n1.5p

n ~ √p n1.5 ~ p

W = O(n3) = O(p1.5) W = O(n3) = O(p2)


Parallel algorithm design and analysis

Dense Matrix-Vector Multiplication (1-D decomposition)

P1
P2
P3
P4
P5 A x y
P6 . =
P7
P8
P9
Parallel algorithm design and analysis

Dense Matrix-Vector Multiplication (2-D decomposition)

P1 P2 P3 P1 P1

P5 P6

A . x = y

P7 P8 P9
Parallel matrix-vector multiplication:
1-D decomposition
Parallel matrix-vector multiplication:
2-D decomposition
Scalability analysis of matrix-vector
multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn


Scalability analysis of matrix-vector
multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn


TO = tsplog(p) + twpn
Scalability analysis of matrix-vector
multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn


TO = tsplog(p) + twpn
W = O(n2)
Scalability analysis of matrix-vector
multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn


TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
Scalability analysis of matrix-vector
multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn


TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)
Scalability analysis of matrix-vector
multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn


TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)


Scalability analysis of matrix-vector
multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)


TO = tsplog(p) + twn√plog(p)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)


TO = tsplog(p) + twn√plog(p)
W = O(n2)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)


TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)


TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ n√plog(p), or n ~ √plog(p),
or W = O(n2) = O(plog2p)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)


TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ n√plog(p), or n ~ √plog(p),
or W = O(n2) = O(plog2p)
Isoefficiency function of dense sparse matrix-
vector multiplication

1-D decomposition 2-D decomposition

W ~ p2 W ~ plog2p

2-D decomposition is likely to yield higher speedups, require smaller


problems to deliver the speedups, and scale more readily to larger number
of computing agents.
Concluding remarks

• Parallelism necessary for continued performance


improvement.
Concluding remarks

• Parallelism necessary for continued performance


improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
Concluding remarks

• Parallelism necessary for continued performance


improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
• Systematic top down parallel application design
Concluding remarks

• Parallelism necessary for continued performance


improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
• Systematic top down parallel application design
• Decomposition strategy is critical
Concluding remarks

• Parallelism necessary for continued performance


improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
• Systematic top down parallel application design
• Decomposition strategy is critical
• Analysis important to understand scalability
Thank you!

You might also like