0% found this document useful (0 votes)
30 views53 pages

High Performance Computing (HPC) - Lec2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 53

High Performance Computing

(HPC)
Lecture 2

By: Dr. Maha Dessokey


Agenda

 Parallel Computer Memory Architectures


 Multithreading vs. Multiprocessing
 Designing Parallel Programs
 HPC Cluster Architecture
Parallel Computer Memory Architectures

 Shared Memory
All processors access all memory as a single global address
space.
Data sharing is fast.
Lack of scalability between memory and CPUs
Parallel Computer Memory Architectures
(Contd.)

 Shared Memory
Advantages:
Global address space provides a user-friendly programming perspective to
memory
Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
Lack of scalability between memory and CPUs
Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
Expense: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors.
Parallel Computer Memory Architectures
(Contd.)

 Distributed Memory
Each processor has its own memory.
Is scalable, no overhead for cache coherency.
Programmer is responsible for many details of
communication between processors.
Parallel Computer Memory Architectures
(Contd.)

 Distributed Memory
Advantages:
Memory is scalable with number of processors
Each processor can rapidly access its own memory
without interference and without the overhead
incurred with trying to maintain cache coherency.
Cost effectiveness: can use commodity, off-the-shelf
processors and networking.
Disadvantages:
The programmer is responsible for many of the details
associated with data communication between
processors.
Agenda

 Parallel Computer Memory Architectures


 Multithreading vs. Multiprocessing
 Designing Parallel Programs
 HPC Cluster Architecture
Multithreading vs. Multiprocessing

Threads - shares “heavyweight”


the same process -
memory space completely
and global separate program
variables with its own
between variables, stack,
routines. and memory
allocation.
Agenda

 Parallel Computer Memory Architectures


 Multithreading vs. Multiprocessing
 Designing Parallel Programs
 HPC Cluster Architecture
Designing Parallel Programs

1. Understand the Problem and the Program


2. Partitioning
3. Communication and Data Dependencies
4. Mapping
1-Understand the Problem and the
Program

 Understand the problem you want to solve in parallel, including any


existing serial code if applicable.
 Before developing a parallel solution, confirm that the problem can
actually be parallelized.
Examples of non-parallelizable problems

 Sequential Dependency Problems


Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of the
formula:
F(k + 2) = F(k + 1) + F(k)
 Input/Output Bound Tasks
File Compression/Decompression: If the process requires sequentially
reading or writing data, it cannot be effectively parallelized.
 Dynamic Programming Problems with Dependencies
Knapsack Problem: The optimal solution for one subproblem may
depend on the solutions to other subproblems in a specific order.
Embarrassingly Parallel Computations

 A computation that can obviously be divided into a number of completely


independent parts, each of which can be executed by a separate process(or)

 No communication or very little communication between processes. Each


process can do its tasks without any interaction with other processes
Embarrassingly Parallel Computations
(Contd.)

Practical embarrassingly parallel computation with


static process creation and master-slave approach
(MPI approach)
Embarrassingly Parallel Computation
Examples

 Low level image processing


x and y are the original and x’and y’ are the new coordinates.
Shifting : Object shifted by Dx in the x-dimension and Dy in the y-dimension:
X’= X+ Dx Y’= Y+ Dy
 Scaling Object scaled by a factor Sx in x-direction and Sy in y-direction:
x’ = x *Sx , y’ = y *Sy
Rotation Object rotated through an angle q about the origin of the coordinate
system:
x’ = x cosq + y sinq y¢ = -x sinq + y cosq
Identify the program's hotspots

 Know where most of the real work is being done. The majority of
scientific and technical programs usually accomplish most of their
work in a few places (functions).
 Profilers and performance analysis tools can help here
 Focus on parallelizing the hotspots and ignore those sections of the
program that account for little CPU usage.
Identify bottlenecks in the program

 Are there areas that are disproportionately slow, or cause


parallelizable work to halt or be deferred? For example, I/O is
usually something that slows a program down.
 May be possible to restructure the program or use a different
algorithm to reduce or eliminate unnecessary slow areas
Other considerations

Identify inhibitors to parallelism. One common class of


inhibitor is data dependence, as demonstrated by the
Fibonacci sequence above.
Investigate other algorithms if possible. This may be the
single most important consideration when designing a
parallel application.
Designing Parallel Programs

1. Understand the Problem and the Program


2. Partitioning
3. Communication and Data Dependencies
4. Mapping
2- Designing Parallel Programs-
Partitioning

 Breaking the problem into discrete "chunks" of work that can


be distributed to multiple tasks. This is known as decomposition
or partitioning.
 There are two basic ways to partition computational work
among parallel tasks:
Domain Decomposition

Functional Decomposition
Partitioning-Domain Decomposition

Domain Decomposition
In this type of partitioning, the
data associated with a
problem is decomposed. Each
parallel task then works on a
portion of the data.
Partitioning-Domain Decomposition
(contd.)

 There are different ways to partition


data
Partitioning-Functional Decomposition

The focus is on the


computation that is to be
performed rather than on
the data manipulated by
the computation. The
problem is decomposed
according to the work that
must be done. Each task
then performs a portion of
the overall work.
Partitioning Examples

 Operations on sequences of number such as simply adding them


together , n= No. of elements, p= No. of processors
Partitioning Examples (contd.)

 Ecosystem Modeling
Each program calculates the
population of a given group, where
each group's growth depends on that
of its neighbors. As time progresses,
each process calculates its current
state, then exchanges information with
the neighbor populations. All tasks then
progress to calculate the state at the
next time step.
Designing Parallel Programs

1. Understand the Problem and the Program


2. Partitioning
3. Communication and Data Dependencies
4. Mapping
3- Designing Parallel Programs -
Communications

 No Communication Needed
Some problems can be executed in parallel with minimal data
sharing. These are known as embarrassingly parallel problems due
to their simplicity and minimal inter-task communication.
 Communication Required
Most parallel applications are not quite so simple, and do require
tasks to share data with each other. For instance, in a 3-D heat
diffusion problem, each task needs temperature information from
neighboring tasks, as changes in neighboring data directly impact
its own results.
Communications- Factors to Consider

1- Cost of communications
Overhead: Inter-task communication consumes
machine cycles and resources that could be used for
computation.
Synchronization: Communication often requires
synchronization, causing tasks to spend time waiting
instead of working.
Bandwidth Saturation: Competing communication
traffic can saturate network bandwidth, worsening
performance issues.
Communications- Factors to Consider
(contd.)

2- Key Communication Metrics


Latency: The time to send a minimal message (0 bytes)
from point A to point B, typically measured in
microseconds.
Bandwidth: The amount of data transmitted per unit of
time, commonly expressed in megabytes per second.
Sending many small messages can lead to high latency,
making it more efficient to combine them into larger
messages to enhance effective communication
bandwidth.
Communications- Factors to Consider
(contd.)

3-Communication Types
 Synchronous Communication: Requires "handshaking" between tasks,
either explicitly coded or handled at a lower level. It's called blocking
communication because other tasks must wait for it to complete.
 Asynchronous Communication: Allows tasks to transfer data
independently. For instance, task 1 can send a message to task 2 and
continue working without waiting for the data to be received. This is
known as non-blocking communication, as other work can proceed
simultaneously.
The main advantage of asynchronous communication is the ability to
interleave computation with communication, maximizing efficiency.
Communications- Factors to Consider
(contd.)

4-Scope of communications
Identifying which tasks need to communicate is crucial during
the design of parallel code. The two types of communication
can be implemented either synchronously or asynchronously:
 Point-to-Point: Involves two tasks, with one acting as the
sender (producer) and the other as the receiver (consumer).
 Collective: Involves data sharing among multiple tasks,
typically organized into a common group or collective.
Collective Communications Example
Designing Parallel Programs - Data
Dependencies

A dependence exists between program statements when


the order of statement execution affects the results of the
program.
A data dependence results from multiple use of the same
location(s) in storage by different tasks.
Dependencies are important to parallel programming
because they are one of the primary inhibitors to
parallelism.
Designing Parallel Programs - Data
Dependencies

Temp = [B]
[B] = [A][B]
[B new] = [A][B old] Tempi,j = σ𝑛𝑘=1 𝐴𝑖, 𝑘* B k,j
B new i,j = σ𝑛𝑘=1 𝐴𝑖, 𝑘* B old k,j .
[B new] = Temp
Designing Parallel Programs - Data
Dependencies (contd.)

i,j+1

i-1,j
i-1,j

i,j
j

i,j-1
i
Designing Parallel Programs - Data Dependencies
(contd.)

How to Handle Data Dependencies?


Distributed memory architectures
Communicate required data at synchronization points.
Shared memory architectures
Synchronize read/write operations between tasks.
Designing Parallel Programs

1. Understand the Problem and the Program


2. Partitioning
3. Communication and Data Dependencies
4. Mapping
4- Designing Parallel Programs -Mapping

1- Load balancing
used to distribute computations
fairly across processors in order
to obtain the highest possible
execution speed
distributing work among tasks
so that all tasks are kept busy
all of the time
Mapping- Load balancing

Imperfect load balancing


leading t Load balancing
to increased execution
time

Perfect load balancing


Mapping- Load balancing

How to Achieve Load Balance?


(1) Equally partition the work each task receives
(2) Use dynamic work assignment.
Agenda

 Parallel Computer Memory Architectures


 Multithreading vs. Multiprocessing
 Designing Parallel Programs
 HPC Cluster Architecture
HPC Platforms

MPI Horizontal Scaling


Scale out
Vertical Scaling
Scale up
SuperComputers
Multiple independent
Installing more processors,
machines are added together
more memory and faster
hardware
FPGA
Vertical Vs. Horizontal HPC platforms

Vertical HPC Platforms


Integration: Components (CPUs, memory, storage) are tightly
integrated within a single system, which can lead to lower latency
and higher bandwidth for data transfers.
Scalability: Performance is achieved by adding more powerful
components (e.g., more CPUs or GPUs) within the same system,
allowing for significant performance gains without the complexities
of inter-node communication.
Efficiency: Often optimized for specific tasks, which can lead to
better overall performance for those tasks due to reduced
overhead in communication and resource management.
Vertical Vs. Horizontal HPC platforms

Horizontal HPC Platforms


Distributed Architecture: Comprises many individual nodes (often
commodity hardware) connected via a network. Each node
operates independently, which can introduce latency in
communication.
Scalability: Can scale out by adding more nodes, allowing for
potentially limitless growth in computational power, but
performance gains may be limited by network bandwidth and
latency.
Flexibility: More adaptable to different workloads and can utilize
a wider range of hardware, But may require more complex
resource management and optimization.
How to measure computer performance?

 What do we mean by “performance”?


For scientific and technical programming use FLOPS
 FLoating point OPerations per Second
 1.324398404 + 3.6287414 = ? • 2.365873534 * 2443.3147 = ?
 Modern supercomputers measured in PFLOPS (PetaFLOPS) • Kilo, Mega, Giga,
Tera, Peta, Exa = 103 , 106 , 109 , 1012, 1015
How to measure computer performance?

 Floating-point operations per second (FLOPS): 𝐹𝐿𝑂𝑃𝑆 = 𝑛𝑜𝑑𝑒𝑠 × 𝑐𝑜𝑟𝑒𝑠


/𝑛𝑜𝑑𝑒𝑠 × 𝑐𝑦𝑐𝑙𝑒𝑠/𝑠𝑒𝑐𝑜𝑛𝑑 × 𝐹𝐿𝑂𝑃𝑠 𝑐𝑦𝑐𝑙𝑒
 The 3rd term clock cycles per second is known as the clock frequency,
typically 2 ~ 3 GHz.
 The 4th term FLOPs per cycle is how many floating-point operations
are done in one clock cycle.
HPC- Benchmarking

 The LINPACK Benchmarks are a measure of a system's floating-point


computing power.
 The aim is to approximate how fast a computer will perform when
solving real problems.
 The peak performance is the maximal theoretical performance a
computer can achieve. The actual performance will always be
lower than the peak performance.
HPC- Benchmarking

 HPL is a portable implementation of Linpack that was written in C,


originally as a guideline, but that is now widely used to provide
data for the TOP500 list, though other technologies and packages
can be used. HPL generates a linear system of equations of order n
and solves it using LU decomposition with partial row pivoting. It
requires installed implementations of MPI and either BLAS or VSIPL to
run.
 Rmax - Maximal LINPACK performance achieved (actual)

 Rpeak - Theoretical peak performance


Top 500 Supercomputers

June 2024 | TOP500

Rmax Rpeak
Rank System Cores (PFlop/s) (PFlop/s) Power (kW)
1 Frontier - HPE 8,699,904 1,206.00 1,714.81 22,786
Cray EX235a,
HPE
DOE/SC/Oak
Ridge
National
Laboratory
United States
HPC Cluster Architecture
HPC cluster components

Nodes: Individual computers in the cluster


Cores (threads): individual processing units available
within each CPU of each Node
e.g. a “Node” with eight “quad”-core CPUs = 32 cores for that
node.
 Shared disk: storage that can be shared (and accessed)
by all nodes
Questions?

You might also like