High Performance Computing (HPC) - Lec2
High Performance Computing (HPC) - Lec2
High Performance Computing (HPC) - Lec2
(HPC)
Lecture 2
Shared Memory
All processors access all memory as a single global address
space.
Data sharing is fast.
Lack of scalability between memory and CPUs
Parallel Computer Memory Architectures
(Contd.)
Shared Memory
Advantages:
Global address space provides a user-friendly programming perspective to
memory
Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
Lack of scalability between memory and CPUs
Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
Expense: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors.
Parallel Computer Memory Architectures
(Contd.)
Distributed Memory
Each processor has its own memory.
Is scalable, no overhead for cache coherency.
Programmer is responsible for many details of
communication between processors.
Parallel Computer Memory Architectures
(Contd.)
Distributed Memory
Advantages:
Memory is scalable with number of processors
Each processor can rapidly access its own memory
without interference and without the overhead
incurred with trying to maintain cache coherency.
Cost effectiveness: can use commodity, off-the-shelf
processors and networking.
Disadvantages:
The programmer is responsible for many of the details
associated with data communication between
processors.
Agenda
Know where most of the real work is being done. The majority of
scientific and technical programs usually accomplish most of their
work in a few places (functions).
Profilers and performance analysis tools can help here
Focus on parallelizing the hotspots and ignore those sections of the
program that account for little CPU usage.
Identify bottlenecks in the program
Functional Decomposition
Partitioning-Domain Decomposition
Domain Decomposition
In this type of partitioning, the
data associated with a
problem is decomposed. Each
parallel task then works on a
portion of the data.
Partitioning-Domain Decomposition
(contd.)
Ecosystem Modeling
Each program calculates the
population of a given group, where
each group's growth depends on that
of its neighbors. As time progresses,
each process calculates its current
state, then exchanges information with
the neighbor populations. All tasks then
progress to calculate the state at the
next time step.
Designing Parallel Programs
No Communication Needed
Some problems can be executed in parallel with minimal data
sharing. These are known as embarrassingly parallel problems due
to their simplicity and minimal inter-task communication.
Communication Required
Most parallel applications are not quite so simple, and do require
tasks to share data with each other. For instance, in a 3-D heat
diffusion problem, each task needs temperature information from
neighboring tasks, as changes in neighboring data directly impact
its own results.
Communications- Factors to Consider
1- Cost of communications
Overhead: Inter-task communication consumes
machine cycles and resources that could be used for
computation.
Synchronization: Communication often requires
synchronization, causing tasks to spend time waiting
instead of working.
Bandwidth Saturation: Competing communication
traffic can saturate network bandwidth, worsening
performance issues.
Communications- Factors to Consider
(contd.)
3-Communication Types
Synchronous Communication: Requires "handshaking" between tasks,
either explicitly coded or handled at a lower level. It's called blocking
communication because other tasks must wait for it to complete.
Asynchronous Communication: Allows tasks to transfer data
independently. For instance, task 1 can send a message to task 2 and
continue working without waiting for the data to be received. This is
known as non-blocking communication, as other work can proceed
simultaneously.
The main advantage of asynchronous communication is the ability to
interleave computation with communication, maximizing efficiency.
Communications- Factors to Consider
(contd.)
4-Scope of communications
Identifying which tasks need to communicate is crucial during
the design of parallel code. The two types of communication
can be implemented either synchronously or asynchronously:
Point-to-Point: Involves two tasks, with one acting as the
sender (producer) and the other as the receiver (consumer).
Collective: Involves data sharing among multiple tasks,
typically organized into a common group or collective.
Collective Communications Example
Designing Parallel Programs - Data
Dependencies
Temp = [B]
[B] = [A][B]
[B new] = [A][B old] Tempi,j = σ𝑛𝑘=1 𝐴𝑖, 𝑘* B k,j
B new i,j = σ𝑛𝑘=1 𝐴𝑖, 𝑘* B old k,j .
[B new] = Temp
Designing Parallel Programs - Data
Dependencies (contd.)
i,j+1
i-1,j
i-1,j
i,j
j
i,j-1
i
Designing Parallel Programs - Data Dependencies
(contd.)
1- Load balancing
used to distribute computations
fairly across processors in order
to obtain the highest possible
execution speed
distributing work among tasks
so that all tasks are kept busy
all of the time
Mapping- Load balancing
Rmax Rpeak
Rank System Cores (PFlop/s) (PFlop/s) Power (kW)
1 Frontier - HPE 8,699,904 1,206.00 1,714.81 22,786
Cray EX235a,
HPE
DOE/SC/Oak
Ridge
National
Laboratory
United States
HPC Cluster Architecture
HPC cluster components