0% found this document useful (0 votes)
18 views

Module 1-Topic 1

hpc

Uploaded by

krishna teja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Module 1-Topic 1

hpc

Uploaded by

krishna teja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

HIGH PERFORMANCE

COMPUTING

1
Module 1-Contents
Topics :
1. High-performance Computing Disciplines
2. Impact of Supercomputing in Science, and Security
3. Anatomy of a Supercomputer
4. Computer Performance
5. A Brief History of Supercomputing

2
Topic 1: High Performance Computing Disciplines

1.1 Definition
1.2 Application Programs
1.3 Performance and Metrics
1.4 High Performance Computing Systems
1.5 Supercomputing Problems
1.6 Application Programming

3
1.1 High Performance Computing (HPC) or Supercomputers -
Definition
• To solve large complex problems by:
• Breaking down the computational portions of the algorithm into concurrent
instructions and deploying a number of processors to work in parallel.

• Parallel computers (Multicore) -> Cluster -> A group of clusters-> HPC

4
• Parallel computing runs multiple tasks simultaneously on
numerous computer servers or processors.
• HPC uses massively parallel computing, which uses tens of
thousands to millions of processors or processor cores.
• An HPC cluster comprises multiple high-speed computer
servers networked with a centralized scheduler that manages
the parallel computing workload.
• The computers, called nodes, use either high-performance
multi-core CPUs or—more likely today—GPUs, which are well
suited for rigorous mathematical calculations, machine
learning (ML) models and graphics-intensive tasks.
• A single HPC cluster can include 100,000 or more nodes.

5
• All the other computing resources in an HPC cluster—such
as networking, memory, storage and file systems—are high speed and
high throughput. They are also low-latency components that can keep
pace with the nodes and optimize the computing power and
performance of the cluster.
• HPC workloads rely on a message passing interface (MPI), a standard
library and protocol for parallel computer programming that allows
users to communicate between nodes in a cluster or across a
network.

6
7
1.2 Application Programs -SAMPLE APPLICATION AREAS OF HPC

8
1.2 Application Programs -SAMPLE APPLICATION AREAS OF HPC

9
10
Addressing the Big Questions
• What grand challenge applications demand these capabilities?
• How do users program such systems?
– What languages and in what environments?
– What are the semantics and strategies?
• How to manage supercomputer resources to deliver useful
computing capabilities?
– What are the hardware mechanisms?
– What are the software policies?
• What are the computational models and algorithms that can map the
innate application properties to the physical medium of the
machine?
• How to integrate enabling technologies into computing engines?
• How to push the performance to extremes?
– What are the enabling conditions?
– What are the inhibiting factors?
1.3 Performance Metrics

• Most widely used metric in HPC:


• FLOPS (floating-point operations per sec)
• Gigaflops=1E9 flops (1*109)
• Teraflops=1E12 flops
• Teraflops (TFLOPS): A measure of one trillion (1,000,000,000,000) operations
per second. 1 PFLOPS is 1,000 TFLOPS.
• Petaflops=1E15flops
• Exaflops=1E18flops
• A measure of one quintillion (1,000,000,000,000,000,000) operations per
second. 1 EFLOPS is 1,000 PFLOPS.

12
1.3 Performance Metrics

• A laptop with a 3 GHz processor can perform around 3 billion calculations per
second.
• High-performance computers (HPCs) can perform quadrillions of calculations per
second.
• A quadrillion operations per second, often abbreviated as peta-operations per second
(PetaOPS or PFLOPS for floating-point operations per second), is a measure of computational
performance.
• A quadrillion is a number with 15 zeros (1,000,000,000,000,000).
• HPC is the supercomputer, which contains thousands of nodes (servers) working
together to complete one or more tasks. This is called "parallel processing."
• High-performance computing architecture comprises several servers networked
together, also known as an HPC cluster. Each server in this cluster is called a node,
and each node works together to increase processing speed.

13
14
HPC consume more power, storage and generates noise

• Storage: The HPC machines appear as rows upon rows of many racks
taking up thousands of square feet.
• Power consumption: consuming potentially multiple megawatts of
electrical power.
• Generates a lot of noise, and rapidly shifting temperature gradients.

15
Biggest Supercomputer in the US

The TITAN petaflops machine fully deployed at Oak Ridge National Laboratory in
2013. It takes up more than 4000 square feet and consumes approximately 8
Megawatts of electrical power. It has a theoretical peak performance of over 27
Petaflops and delivers 17.6 Petaflops Rmax sustained performance for the HPL
(Linpack) benchmark. This architecture includes Nvidia GPU accelerators.

16
1.4 High Performance Computing Systems :
Conventional Heterogeneous Multicore System Architecture

Global Interconnection Network

NIC

Accelerator

Core Array

Multicore Sockets

Scratchpad
memory …
Memory Banks
Node Node Node
Node
2 3 N
1

17
1.4 High Performance Computing Systems : Conventional
Heterogeneous Multicore System Architecture
1. Core Array: A structured arrangement of multiple processor cores
designed for parallel processing.
2. Multicore Socket: A single CPU package containing multiple cores,
allowing for simultaneous processing of multiple tasks.
3. NIC (Network Interface Card): A hardware component enabling
high-speed network communication between nodes in an HPC cluster.
4. Scratchpad Memory: Fast, local memory used to store frequently
accessed temporary data, enhancing computational efficiency.

18
1.4 High Performance Computing Systems : Conventional Heterogeneous
Multicore System Architecture

1. Core Array
• A core array in HPC refers to a configuration of multiple processor cores
arranged in a structured, interconnected grid or array.
• This setup is designed to maximize computational efficiency and
parallelism.
• Each core in the array can execute instructions and perform calculations
independently, and they often work together to handle large-scale
computations.
• The core array consist of either homogeneous cores or heterogeneous
cores.
• The interconnect network between these cores is optimized to ensure high
data transfer bandwidth and low latency, facilitating efficient parallel
processing.
19
Homogeneous Cores: Identical cores with uniform capabilities, e.g., Intel Xeon
processors.

• Homogeneous Cores
• Homogeneous cores are identical processing units within a system.
Each core has the same architecture, capabilities, and performance
characteristics. This uniformity simplifies the design of parallel
applications, as all tasks can be distributed equally among the cores.
• Example:
• Intel Xeon Processors: These processors feature multiple identical cores, all
capable of executing the same instructions at the same speed. They are often
used in data centers and HPC environments where uniform performance is
desired.

20
Heterogeneous Cores: Different types of cores optimized for
specific tasks, e.g., NVIDIA Tegra X1 and Apple M1.

• Heterogeneous Cores
• Heterogeneous cores refer to systems that contain different types of processing units,
each optimized for specific tasks. This configuration can provide better overall
performance and energy efficiency by using the most appropriate core for each task.
• Example:
• NVIDIA Tegra X1: This SoC (System on Chip) combines ARM Cortex-A57 and Cortex-A53
CPU cores (for general-purpose processing) with a 256-core Maxwell GPU (for graphics
and parallel computations). The different cores handle different types of workloads,
providing a balance between performance and power efficiency.
• Apple M1: This processor includes high-performance cores (for demanding tasks) and
high-efficiency cores (for less intensive tasks) along with an integrated GPU and a Neural
Engine for machine learning tasks. This design allows the system to optimize
performance and power consumption dynamically.

21
Conventional Heterogeneous Multicore System Architecture

2. Multicore Socket
• A multicore socket is a single physical CPU package that contains
multiple processor cores. Modern processors, especially those used in
HPC systems, integrate several cores within a single socket to increase
computational power and enable parallel processing.
• Multicore CPU: A CPU with multiple cores within one physical
package. Each core can independently execute its own thread or
process, allowing multiple tasks to be processed simultaneously.
• Socket: The physical interface on the motherboard that houses the
CPU. An HPC system can have multiple sockets, each with a multicore
CPU, increasing the total number of cores available for computation.
22
Conventional Heterogeneous Multicore System Architecture

3. Network Interface Card (NIC)


• A Network Interface Card (NIC) is a hardware component that
connects a computer to a network. In HPC, NICs are crucial for
enabling high-speed data transfer between nodes in a cluster.
• Function: NICs handle the input and output of data to and from the
network, facilitating communication between different nodes
(computers) in an HPC cluster.
• High-Speed Interconnects: In HPC, NICs often support high-speed
interconnect technologies such as InfiniBand, Ethernet, or Omni-Path,
which provide low-latency and high-bandwidth communication
essential for distributed computing tasks.
23
Conventional Heterogeneous Multicore System Architecture

4. Scratchpad Memory
• Scratchpad memory is a type of fast, local memory used in HPC systems to
store temporary data that is frequently accessed by the processor cores. It
is designed to be faster and more efficient than regular main memory
(DRAM).
• Purpose: Scratchpad memory is used to reduce latency and increase the
speed of data access for certain computations. It is often employed in
specialized processors like GPUs or TPUs to store intermediate results and
data that require quick read/write access.
• Characteristics: It is usually smaller in size compared to main memory but
much faster, providing a dedicated space for critical data during
computation-intensive tasks.
24
Conventional Heterogeneous Multicore System Architecture
5. Accelerator:
A co-processor , a special hardware component to speed up computations that are
more compute-intensive or time-consuming. Example : Google’s TPU
• Tensor Processing Unit (TPU) is an AI accelerator application-specific
integrated circuit (ASIC) developed by Google for neural network machine
learning, using Google's own TensorFlow software

25
What distinguishes a HPC from a conventional computer?

• Purpose & Usage: - Designed for complex computations and large-scale simulations.
• Architecture: Composed of a cluster of many nodes (each node being a powerful computer)
connected through high-speed networks.

• Performance: Delivers significantly higher performance in terms of processing speed, memory


capacity, and data throughput.

• Scalability: Highly scalable, can add more nodes to increase computational power.
• Software and Applications: - Runs specialized software optimized for parallel processing and
large-scale computations.

• Cost and Maintenance: Significantly more expensive to purchase, operate, and maintain due to
the complexity and scale of the hardware.

26
1.5 Supercomputing Problems

27
Supercomputing Problems

• The main benchmark currently used to measure a supercomputer’s


peak performance is a dense linear algebra problem.
• Dense linear algebra (DLA) problems are problems that can be solved using a
relatively small set of standard mathematical operations, such as
multiplication, LU factorization, or the symmetric eigenvalue problem. DLA
problems include: Solving dense systems of linear equations, Least square
problems, Eigenvalue and singular value problems, and Other related
computational tasks.

28
A Particle-In-Cell simulation from the Gyrokinetic Toroidal code (Princeton
Plasma Physics laboratory) that simulates a plasma within a Tokomak fusion
device. A sampling of some particles within the toroid are shown here colored
according to their velocity with different supercomputing processor
boundaries delineated by the toroidal subdivisions.
1.6 APPLICATION PROGRAMMING
• What are the requirements and characteristics of application programming in the
context of HPC ?
• The principal view the user has of a HPC system is through one or more
programming interfaces, which take the form of programming languages,
libraries, or other services.
1. Correctness
2. Reliability
3. Performance is the driving requirement that differentiates HPC programming
from other domains
• Performance is most significantly represented by the need for representation and
exploitation of computational parallelism: the ability to perform multiple tasks
simultaneously.
4. Parallel processing involves the definition of parallel tasks, establishing the
criteria that determine
• when a task is performed, synchronization among tasks in part to coordinate sharing, and
allocation to computing resources.

30
APPLICATION PROGRAMMING

5. Control of the relationship of allocations of data and tasks to the physical


resources of the parallel and distributed systems.
• The nature of the parallelism may vary significantly depending on the form of
computer system architecture targeted by the application program.
6. Also of concern are issues of determinism, correctness, performance debugging,
and performance portability.

31
Parallel Programming Models used in HPC determines the nature of parallelism

Depending on the nature of the class of parallel system architecture, different


programming models are employed. One dimension of differentiation is granularity
of the parallel workflow.
• Coarse-grained parallelism
• Very coarse-grained workloads with no interactivity, sometimes referred to as
“embarrassingly parallel” or “job-stream” workflow
• Fine-grained parallelism
• multiple-thread shared-memory system programming interfaces such as
OpenMP and Cilkþþ
• Medium-grained parallelism
• highly scaled massively parallel processors (MPPs) and clusters, is primarily
represented by communicating sequential processes such as the message-
passing interface (MPI)

32
Backup slides –Only for reference

33
Intel Xeon processor-An example for homogeneous core

34
NVIDIA and Apple integerated on a chip–An example for heterogeneous core

NVIDIA Tegra X1 Apple M1


35
36

You might also like