HPC Intro Ad OS
HPC Intro Ad OS
Introduction
• What is HPC?
• HPC refers to aggregating computing power in a way that
delivers much higher performance than typical desktop or
workstation systems. It is commonly used for scientific
simulations, large-scale data analysis, and complex
mathematical calculations.
• Key Benefits:
• Speed: HPC systems can perform trillions of calculations per
second (teraflops or petaflops).
• Capacity: They handle large datasets that normal computers
cannot process.
• Efficiency: HPC systems use multiple processors in parallel,
leading to faster completion of tasks.
Components of HPC
Master node
Switch
• These are the core components of the cluster responsible for performing calculations and
running applications.
• They typically consist of multiple CPU cores and can include GPUs or other accelerators for
specific workloads.
• CPU Nodes: General-purpose processors that handle the majority of tasks.
• GPU Nodes: Specialized processors designed for highly parallel tasks, such as
rendering graphics or scientific simulations.
• Parallelism: Tasks are divided into smaller sub-tasks and distributed across
multiple nodes to be processed simultaneously. This is called parallel computing.
• Typically compute nodes don’t do any cluster management functions; they
just compute.
• users generally do not have access to the compute nodes directly - access
to these resources is controlled by a scheduler or batch system
• Compute nodes are usually systems that run the bare minimum OS –
meaning that unneeded daemons are turned off and unneeded packages
are not installed – and have the bare minimum hardware.
Interconnect (Networking)
• The nodes are connected by high-speed networks called
interconnects, which allow data to move between
nodes quickly.
• Common interconnect technologies include InfiniBand
and Ethernet.
• The faster the interconnect, the quicker nodes can share
data and work together.
Storage Systems
• Moore's Law:
• Predicted by Gordon Moore in 1965.
• States that the number of transistors on a microchip doubles
approximately every two years, leading to a corresponding increase in
computing power.
• Relevance to HPC:
• Increasing Power: As processors get more powerful, HPC systems
become faster and can handle larger datasets and more complex
computations.
• Parallelism: Even though individual processors are more powerful, HPC
leverages many processors (nodes) working together to solve
problems in parallel, taking full advantage of Moore’s Law.
• Challenges: As we approach the physical limits of miniaturizing
transistors, the focus shifts to parallel computing and specialized
processors (e.g., GPUs) in HPC to continue improving performance.
Importance of FPU (Floating Point Unit)
FPU is a specialized part of the CPU designed to handle floating-point arithmetic, which
involves decimal numbers and very large or small values.
• Essential for scientific and engineering tasks where precise calculations involving decimals
are common.
• The term "FLOPS" stands for "floating-point operations per second," and
"tera" represents a trillion (10^12). Therefore, one teraflop is equivalent to
one trillion floating-point operations per second.
• Theoretical performance, often expressed in TFLOPS, represents the maximum
computational power a computer system or processor is capable of achieving
under ideal conditions . It is a measure of peak performance.
What is CUDA?
• CUDA (Compute Unified Device Architecture) is a parallel
computing platform and programming model developed by NVIDIA.
• It allows developers to use GPUs (Graphics Processing Units) for
general-purpose processing, not just graphics rendering.
Why CUDA is Important for HPC
• Accelerates Performance: CUDA enables massive parallelism,
allowing thousands of threads to execute simultaneously on a GPU.
• Streaming Multiprocessors (SMs): The GPU is divided into several Streaming Multiprocessors,
each of which contains many cores. These cores execute instructions in parallel, making the SM the
building block of parallel computation.
• Cores (ALUs - Arithmetic Logic Units):
• Each SM contains multiple ALUs, which are responsible for executing arithmetic operations such as addition,
subtraction, and multiplication.
• The large number of ALUs allows the GPU to handle many operations at once, ideal for tasks like matrix operations
in machine learning or scientific simulations.
• Warp:
• A warp is a group of 32 threads that execute the same instruction simultaneously on different data.
• The GPU schedules and executes warps to achieve high throughput, ensuring all cores are utilized.
• Memory Hierarchy in a GPU
1. Global Memory:
1. Largest and slowest memory on the GPU.
2. Accessible to all cores, but with a higher latency than other types of memory.
3. Often used for transferring data between the CPU and GPU.
2. Shared Memory:
1. Fast, low-latency memory shared among all cores within an SM.
2. Useful for communication between threads and for speeding up certain computations by reducing
access to slower global memory.
3. Registers:
1. Each thread in a warp has its own set of registers for storing temporary variables.
2. Registers provide the fastest access to data but are limited in number.
4. L1 and L2 Cache:
1. L1 Cache: Small, fast memory cache located close to the SM. Helps reduce latency when accessing global
memory.
2. L2 Cache: Larger than L1, shared across the entire GPU to reduce traffic to global memory. Stores frequently
SIMD Execution Model (Single Instruction, Multiple Data)
• GPUs operate on the SIMD model, meaning that a single instruction is executed on multiple data points at once.
• Each core in an SM executes the same instruction in parallel across different data (e.g., processing many pixels or data
points at the same time).
• This is ideal for tasks where the same computation needs to be applied to large datasets, like matrix operations or
vector processing.
Interconnect:
• The GPU communicates with the CPU and memory using high-speed interfaces like PCIe. Data is transferred from the
CPU's memory to the GPU’s global memory, processed, and then sent back.
Thermal Management:
• GPUs often handle intensive workloads, leading to heat generation. Modern GPUs are equipped with advanced
cooling systems (heat sinks, fans, liquid cooling) to maintain optimal performance and avoid thermal throttling.
Typical HPC Process steps
Result
Storage Computatio
Aggregatio
and Output n
n
• Step 1: Task Breakdown (Parallelism)
Single Instruction, Multiple Data (SIMD): The task is broken into multiple smaller
tasks, which are then executed in parallel.
Compiler optimizations
For instance, if you’re simulating weather, different parts of the simulation (temperature,
pressure, humidity) can be processed in parallel across different regions of the map.
• Step 2: Task Scheduling
The job scheduler assigns each sub-task to a node. The scheduling can be batch
processing (where jobs wait in a queue) or real-time processing depending on the
priority.
Software like Slurm, Torque, or Moab is used to manage and allocate resources, schedule jobs, and ensure
efficient utilization of compute nodes.
Batch processing systems like Slurm handle job submission, scheduling, and resource allocation based on
user-defined policies.
• Step 3: Data Transfer
Load balancing ensures that each node is assigned the right amount of work,
optimizing system performance.
Step 4: Computation
Each node processes its assigned task independently, making use of its CPUs and
GPUs.
For example, in a weather simulation, Node 1 might be calculating the temperature for
Region A, while Node 2 handles pressure for Region B.
Step 5: Result Aggregation
Once all nodes have completed their tasks, the results are combined (aggregated) into
a final output.
The aggregation process also includes error checking and validation to ensure
accurate results.
• Step 6: Storage and Output
The computed results are then stored in high-speed storage systems and made
available for post-processing, analysis, or visualization.
Important terms in HPC
Clock Speed
• Definition: The speed at which a CPU or GPU can
execute instructions, measured in GHz.
• Impact: A higher clock speed allows the processor to
perform more operations in a given time.
• In HPC: Critical for quick data processing, though
performance depends on other factors like cores and
memory.
• Number of Cores
• Definition: The number of processing units within a
CPU or GPU.
• Impact: More cores enable parallelism — multiple
tasks running simultaneously.
• In HPC: More cores mean higher processing power,
vital for handling complex workloads efficiently.
SIMD Units
• Definition: Single Instruction, Multiple Data (SIMD)
units process a single instruction on multiple data points
at once.
• Impact: Increases efficiency in parallel data
processing.
• In HPC: Enhances computational throughput,
particularly for tasks like matrix operations and
scientific simulations.
Pipeline Efficiency
• Definition: Refers to how efficiently the CPU or GPU
processes instructions in the pipeline.
• Impact: Higher efficiency means fewer delays in
instruction execution, leading to faster overall
performance.
• In HPC: Important for minimizing bottlenecks and
maximizing throughput.
Memory Bandwidth
• Definition: The speed at which data is transferred
between memory and processors, measured in GB/s.
• Impact: Higher bandwidth allows faster access to data,
improving processing speeds.
• In HPC: Essential for large-scale data processing,
reducing latency in memory-heavy operations.
Precision
• Definition: The accuracy of floating-point calculations,
commonly measured as single precision (32-bit) or
double precision (64-bit).
• Impact: Higher precision (e.g., double precision)
provides more accurate calculations but requires more
computational power.
• In HPC: Critical for scientific calculations where high
accuracy is necessary.
Parallelism
• Definition: The ability to perform multiple calculations
at the same time.
• Impact: Greatly increases efficiency by dividing tasks
among multiple processors or cores.
• In HPC: Vital for handling massive datasets and
complex simulations in less time.
Software Optimization
• Definition: The efficiency of software in utilizing the
hardware resources available.
• Impact: Well-optimized software ensures that the
hardware's full potential is used.
• In HPC: Proper optimization can dramatically improve
performance, especially on specialized hardware.
References
1. Introduction to High-Performance Computing
• Paper: An Introduction to High-Performance Computing (HPC)
• Authors: Thomas Sterling, Matthew Anderson
• Summary: This paper gives an overview of the core concepts of HPC, including parallel computing, clusters,
and cloud HPC. It's a beginner-friendly introduction to the architecture and uses of HPC.
• Why it helps: Good for understanding the basics of HPC and why it’s important in modern science and
industry.
2. HPC for Healthcare Applications
• Paper: High-Performance Computing in Biomedical Research
• Summary: Discusses the role of HPC in transforming biomedical research by enabling the analysis of
complex data, from genomics to drug discovery.
• Why it helps: Shows how HPC is applied to healthcare challenges like disease diagnosis and personalized
medicine, making it relatable and socially impactful.
3. HPC and Climate Change
• Paper: HPC for Climate and Weather Research: Current Status and Future Prospects
• Summary: This paper explores the use of HPC in climate modeling and weather forecasting. It highlights how
HPC helps in simulating complex environmental systems.
• Why it helps: Climate change is a socially relevant issue, and showing how HPC aids in its study can engage
students interested in environmental science.
4. HPC in Social Impact Projects
• Paper: High-Performance Computing for Humanitarian Assistance and Disaster Relief
• Summary: This paper looks at how HPC is used to simulate and respond to natural disasters,
helping with tasks like evacuation modeling and predicting the impact of earthquakes.
• Why it helps: Provides concrete examples of how HPC directly benefits society by improving
response to disasters.
* Advance reservation
* Plugin support
Typical List of Software
* AMBER 18
* CUDA 11.6
* CUDNN
* FFMPEG 5.0.1
* FREESURFER
* GROMACS
* LEPTONICA
* MATLAB 2023
* NAMD
* QUANTUM ESPRESSO
* R
* OPENMPI
* SINGULARITY
Deep Learning: Training large neural networks for image recognition, natural language processing, and more.
Data Analysis: Analyzing vast datasets for insights and pattern recognition.
9. Particle Physics:
Large Hadron Collider (LHC): Analyzing vast amounts of data to discover new particles and understand the fundamental structure of
matter.
• The cluster consists of 123 nodes, each featuring dual Intel Xeon 2640 v4, Xeon Gold
5317, or AMD EPYC 9124 processors and 128 to 256 GB of RAM. Each node is also
equipped with four NVIDIA 1080 Ti, 2080 Ti, 3080 Ti, 3090, RTX 6000, or L40S GPUs.
• The nodes are connected to each other via a Gigabit Ethernet network. All compute
nodes have a 1.8 TB local scratch and a 960 GB local SSD scratch. The compute
nodes are running Ubuntu 18.04 LTS. SLURM software is used as a job scheduler and
resource manager.
• No. of nodes: 123
• No. of CPU cores: 3408
• No. of GPUs: 484
• Total memory: 20,352 GB
• Storage : ~ 320 TB
• TFLOPS: 98.2 (CPU) + 9987 (FP32 GPU)
• Installed Libraries and Software: CUDA, CuDNN, FFPMEG, FFTW, LAPACK, MKL,
OpenMPI, TensorRT, AMBER 20, Caffe, GROMACS, Leptonica, MATLAB, PyTorch, R,
Quantum Espresso, Singularity, VASP.
Ada Layout