0% found this document useful (0 votes)
14 views44 pages

HPC Intro Ad OS

Uploaded by

bholasaxena277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views44 pages

HPC Intro Ad OS

Uploaded by

bholasaxena277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to HPC

Introduction

• What is HPC?
• HPC refers to aggregating computing power in a way that
delivers much higher performance than typical desktop or
workstation systems. It is commonly used for scientific
simulations, large-scale data analysis, and complex
mathematical calculations.
• Key Benefits:
• Speed: HPC systems can perform trillions of calculations per
second (teraflops or petaflops).
• Capacity: They handle large datasets that normal computers
cannot process.
• Efficiency: HPC systems use multiple processors in parallel,
leading to faster completion of tasks.
Components of HPC

Master node

Switch

Compute node Compute node Compute node


Master Node
• “controller” node or “management” node for the
cluster.
• where you login to interact with the HPC system.
• controls and performs the housekeeping for the cluster
• For smaller clusters, for computation as well as
management
• as the cluster grows larger, becomes specialized and is
not used for computation
Compute Node

• These are the core components of the cluster responsible for performing calculations and
running applications.
• They typically consist of multiple CPU cores and can include GPUs or other accelerators for
specific workloads.
• CPU Nodes: General-purpose processors that handle the majority of tasks.
• GPU Nodes: Specialized processors designed for highly parallel tasks, such as
rendering graphics or scientific simulations.
• Parallelism: Tasks are divided into smaller sub-tasks and distributed across
multiple nodes to be processed simultaneously. This is called parallel computing.
• Typically compute nodes don’t do any cluster management functions; they
just compute.
• users generally do not have access to the compute nodes directly - access
to these resources is controlled by a scheduler or batch system

• Compute nodes are usually systems that run the bare minimum OS –
meaning that unneeded daemons are turned off and unneeded packages
are not installed – and have the bare minimum hardware.
Interconnect (Networking)
• The nodes are connected by high-speed networks called
interconnects, which allow data to move between
nodes quickly.
• Common interconnect technologies include InfiniBand
and Ethernet.
• The faster the interconnect, the quicker nodes can share
data and work together.
Storage Systems

• HPC systems require fast, large-capacity storage to


handle the massive amounts of data generated during
computations.
• Local Storage: Each node may have its own storage
for temporary data.
• Central Storage: High-speed, centralized storage
systems (e.g., parallel file systems like Lustre or GPFS)
store and manage data across the whole system.
Additional Nodes
• As the cluster grows, other roles typically arise,
requiring that nodes be added.
• For example,
• data servers can be added to the cluster. These nodes don’t
run applications; rather, they store and serve data to the rest
of the cluster.
• Additional nodes can provide data visualization capabilities
within the cluster (usually remote visualization),
• very large clusters might need nodes dedicated to monitoring
the cluster
• Login nodes for logging in users to the cluster and running
applications.
Software and Middleware

• Operating Systems: HPC systems often run


specialized versions of Linux (e.g., CentOS, RHEL)
because of their flexibility and performance tuning for
large-scale systems.
• Job Schedulers/Resource Managers: Software like
Slurm or Torque, or Moab or PBS manages the
distribution of tasks (jobs) to available nodes, optimizing
workload efficiency.
• Parallel Libraries and Tools: Tools like MPI
(Message Passing Interface) and OpenMP (Open
Multi-Processing) allow the distribution of tasks
across multiple processors or nodes.
Other HPC
Components
Management and Monitoring Tools: These tools help administrators
manage and monitor the cluster's health, performance, and utilization.
Examples include Ganglia, Nagios, and OpenHPC.
Backup and Data Replication: Data protection mechanisms, such as
backups and data replication across multiple nodes or sites, ensure data
integrity and availability.
User Interfaces: User interfaces provide a way for users to interact with
the cluster, submit jobs, monitor progress, and access cluster resources.
Remote Access: Remote access mechanisms, such as SSH and VPNs,
enable administrators and users to manage and access the cluster from
remote locations securely.
Scalability: HPC clusters are designed with scalability in mind, allowing
for the addition of more compute nodes, storage, and network resources
as computational needs grow.
Maintenance and Monitoring HPC
Infra
• Cooling and Power: Proper cooling and power management are
essential to maintain the cluster's stability and prevent
overheating or power failures. This includes cooling solutions like
HVAC systems and redundant power supplies.
• Backup Power: Uninterruptible Power Supplies (UPS) or backup
generators can provide power redundancy to protect against
power outages.
• Environmental Monitoring: Sensors and monitoring systems
help track temperature, humidity, and other environmental
factors that can impact the cluster's performance and longevity.
Moore's Law and HPC

• Moore's Law:
• Predicted by Gordon Moore in 1965.
• States that the number of transistors on a microchip doubles
approximately every two years, leading to a corresponding increase in
computing power.
• Relevance to HPC:
• Increasing Power: As processors get more powerful, HPC systems
become faster and can handle larger datasets and more complex
computations.
• Parallelism: Even though individual processors are more powerful, HPC
leverages many processors (nodes) working together to solve
problems in parallel, taking full advantage of Moore’s Law.
• Challenges: As we approach the physical limits of miniaturizing
transistors, the focus shifts to parallel computing and specialized
processors (e.g., GPUs) in HPC to continue improving performance.
Importance of FPU (Floating Point Unit)
FPU is a specialized part of the CPU designed to handle floating-point arithmetic, which
involves decimal numbers and very large or small values.
• Essential for scientific and engineering tasks where precise calculations involving decimals
are common.

FPU and FLOPS (Floating Point Operations Per Second):


• HPC performance is often measured in FLOPS, which quantifies how many floating-point
operations a system can perform per second.
• The more efficient the FPU, the higher the FLOPS, leading to faster computation in HPC tasks.
Parallelism and Efficiency:
• HPC systems use many FPUs in parallel, enabling large numbers of calculations to happen
simultaneously.
• This parallelism helps HPC systems solve complex problems much faster by distributing tasks
across multiple FPUs.
Precision in Calculations:
• FPUs support both single and double-precision floating-point operations, depending on the
accuracy needed.
• In HPC, double-precision is often required for scientific accuracy, making the FPU essential
for handling these high-precision tasks.
FLOPS Vs MIPS
• FLOPS (Floating Point Operations Per Second):
Measures how many floating-point calculations a system can
perform in one second. Used to describe the power of HPC
systems.
• MIPS (Million Instructions Per Second): Measures how
many general instructions (handled by the CPU) can be
processed. CPUs are good at managing these general tasks.
• HPC systems focus on FLOPS since most of the
computational tasks are floating-point intensive, making FPUs
essential.
Performance : TFLOPS
• TFLOPS stands for "teraflops," which is a unit of measurement used to
quantify the computational performance or processing speed of a computer
system and provides a useful benchmark for comparing the computational
capabilities of different systems.

• The term "FLOPS" stands for "floating-point operations per second," and
"tera" represents a trillion (10^12). Therefore, one teraflop is equivalent to
one trillion floating-point operations per second.
• Theoretical performance, often expressed in TFLOPS, represents the maximum
computational power a computer system or processor is capable of achieving
under ideal conditions . It is a measure of peak performance.
What is CUDA?
• CUDA (Compute Unified Device Architecture) is a parallel
computing platform and programming model developed by NVIDIA.
• It allows developers to use GPUs (Graphics Processing Units) for
general-purpose processing, not just graphics rendering.
Why CUDA is Important for HPC
• Accelerates Performance: CUDA enables massive parallelism,
allowing thousands of threads to execute simultaneously on a GPU.

Real-World Applications in HPC


• Scientific Simulations: CUDA is used in fields like physics, molecular
biology, and climate modeling to run simulations faster.
• Big Data Processing: It helps process huge datasets quickly by
leveraging the GPU’s ability to handle parallel workloads.
GPU vs CPU:
CPU:
• While CPUs are optimized for single-threaded performance,
• Fewer cores, but optimized for individual task performance.
GPU:
• GPUs excel at parallel tasks, making them ideal for
large-scale computations like simulations or data
processing in HPC.
Thousands of cores that can handle many tasks at once,
ideal for large-scale parallel computing in HPC.
How CUDA Works in HPC
Parallel Processing
• CUDA programs are written to run multiple tasks at once on
hundreds or thousands of GPU cores.
• Each core handles a small part of the computation, making it
possible to solve problems that require vast computational power.
Easy Integration with Existing Languages
• CUDA can be integrated into common programming languages like
C, C++, and Python, making it accessible for HPC programmers.
High Throughput for HPC
• CUDA increases throughput in HPC tasks like:
• Matrix operations
• Image and signal processing
• Machine learning algorithms.
GPU Internals (Deep Dive) - Architecture of a GPU

• Streaming Multiprocessors (SMs): The GPU is divided into several Streaming Multiprocessors,
each of which contains many cores. These cores execute instructions in parallel, making the SM the
building block of parallel computation.
• Cores (ALUs - Arithmetic Logic Units):
• Each SM contains multiple ALUs, which are responsible for executing arithmetic operations such as addition,
subtraction, and multiplication.
• The large number of ALUs allows the GPU to handle many operations at once, ideal for tasks like matrix operations
in machine learning or scientific simulations.
• Warp:
• A warp is a group of 32 threads that execute the same instruction simultaneously on different data.
• The GPU schedules and executes warps to achieve high throughput, ensuring all cores are utilized.
• Memory Hierarchy in a GPU
1. Global Memory:
1. Largest and slowest memory on the GPU.
2. Accessible to all cores, but with a higher latency than other types of memory.
3. Often used for transferring data between the CPU and GPU.
2. Shared Memory:
1. Fast, low-latency memory shared among all cores within an SM.
2. Useful for communication between threads and for speeding up certain computations by reducing
access to slower global memory.
3. Registers:
1. Each thread in a warp has its own set of registers for storing temporary variables.
2. Registers provide the fastest access to data but are limited in number.
4. L1 and L2 Cache:
1. L1 Cache: Small, fast memory cache located close to the SM. Helps reduce latency when accessing global
memory.
2. L2 Cache: Larger than L1, shared across the entire GPU to reduce traffic to global memory. Stores frequently
SIMD Execution Model (Single Instruction, Multiple Data)
• GPUs operate on the SIMD model, meaning that a single instruction is executed on multiple data points at once.
• Each core in an SM executes the same instruction in parallel across different data (e.g., processing many pixels or data
points at the same time).
• This is ideal for tasks where the same computation needs to be applied to large datasets, like matrix operations or
vector processing.

Thread and Block Management


• Threads: A GPU can run thousands of threads in parallel, with each thread performing a part of a larger computation.
• Blocks: Threads are grouped into blocks. Each block is executed on a single SM, and threads within the block can
share data through shared memory and synchronize execution.
• Grid: A collection of thread blocks that are executed across multiple SMs. A grid represents the total workload
submitted to the GPU.
Execution Flow
1. Launch Kernels: The CPU sends kernels (GPU functions) to the GPU to be executed in parallel.
2. Thread Scheduling: The GPU schedules threads in groups (warps) and assigns them to different SMs.
3. Memory Access: Threads fetch data from global memory, and cache/memory optimizations ensure quick access to
frequently used data.
4. Processing: Each core processes a part of the workload, executing the same instruction in parallel across multiple
data points.
5. Result: The results are stored back in memory and transferred to the CPU for further processing if needed.

Interconnect:
• The GPU communicates with the CPU and memory using high-speed interfaces like PCIe. Data is transferred from the
CPU's memory to the GPU’s global memory, processed, and then sent back.
Thermal Management:
• GPUs often handle intensive workloads, leading to heat generation. Modern GPUs are equipped with advanced
cooling systems (heat sinks, fans, liquid cooling) to maintain optimal performance and avoid thermal throttling.
Typical HPC Process steps

Task Task Data


Breakdown Scheduling Transfer

Result
Storage Computatio
Aggregatio
and Output n
n
• Step 1: Task Breakdown (Parallelism)

 Single Instruction, Multiple Data (SIMD): The task is broken into multiple smaller
tasks, which are then executed in parallel.

 Parallelism at processor/ instruction level Pipelining (overlap in execution: fetch,


decode, execute)

 Typical Programming techniques


 Code modifications: Unrolling, Cache reuse

 Compiler optimizations

For instance, if you’re simulating weather, different parts of the simulation (temperature,
pressure, humidity) can be processed in parallel across different regions of the map.
• Step 2: Task Scheduling

 The job scheduler assigns each sub-task to a node. The scheduling can be batch
processing (where jobs wait in a queue) or real-time processing depending on the
priority.

 Software like Slurm, Torque, or Moab is used to manage and allocate resources, schedule jobs, and ensure
efficient utilization of compute nodes.

 Batch processing systems like Slurm handle job submission, scheduling, and resource allocation based on
user-defined policies.
• Step 3: Data Transfer

 The interconnect handles the communication between nodes, transferring necessary


data between them. This ensures the nodes work together without bottlenecks.

 Load balancing ensures that each node is assigned the right amount of work,
optimizing system performance.
Step 4: Computation

 Each node processes its assigned task independently, making use of its CPUs and
GPUs.

 For example, in a weather simulation, Node 1 might be calculating the temperature for
Region A, while Node 2 handles pressure for Region B.
Step 5: Result Aggregation

 Once all nodes have completed their tasks, the results are combined (aggregated) into
a final output.

 The aggregation process also includes error checking and validation to ensure
accurate results.
• Step 6: Storage and Output

 The computed results are then stored in high-speed storage systems and made
available for post-processing, analysis, or visualization.
Important terms in HPC
Clock Speed
• Definition: The speed at which a CPU or GPU can
execute instructions, measured in GHz.
• Impact: A higher clock speed allows the processor to
perform more operations in a given time.
• In HPC: Critical for quick data processing, though
performance depends on other factors like cores and
memory.
• Number of Cores
• Definition: The number of processing units within a
CPU or GPU.
• Impact: More cores enable parallelism — multiple
tasks running simultaneously.
• In HPC: More cores mean higher processing power,
vital for handling complex workloads efficiently.
SIMD Units
• Definition: Single Instruction, Multiple Data (SIMD)
units process a single instruction on multiple data points
at once.
• Impact: Increases efficiency in parallel data
processing.
• In HPC: Enhances computational throughput,
particularly for tasks like matrix operations and
scientific simulations.
Pipeline Efficiency
• Definition: Refers to how efficiently the CPU or GPU
processes instructions in the pipeline.
• Impact: Higher efficiency means fewer delays in
instruction execution, leading to faster overall
performance.
• In HPC: Important for minimizing bottlenecks and
maximizing throughput.
Memory Bandwidth
• Definition: The speed at which data is transferred
between memory and processors, measured in GB/s.
• Impact: Higher bandwidth allows faster access to data,
improving processing speeds.
• In HPC: Essential for large-scale data processing,
reducing latency in memory-heavy operations.
Precision
• Definition: The accuracy of floating-point calculations,
commonly measured as single precision (32-bit) or
double precision (64-bit).
• Impact: Higher precision (e.g., double precision)
provides more accurate calculations but requires more
computational power.
• In HPC: Critical for scientific calculations where high
accuracy is necessary.
Parallelism
• Definition: The ability to perform multiple calculations
at the same time.
• Impact: Greatly increases efficiency by dividing tasks
among multiple processors or cores.
• In HPC: Vital for handling massive datasets and
complex simulations in less time.
Software Optimization
• Definition: The efficiency of software in utilizing the
hardware resources available.
• Impact: Well-optimized software ensures that the
hardware's full potential is used.
• In HPC: Proper optimization can dramatically improve
performance, especially on specialized hardware.
References
1. Introduction to High-Performance Computing
• Paper: An Introduction to High-Performance Computing (HPC)
• Authors: Thomas Sterling, Matthew Anderson
• Summary: This paper gives an overview of the core concepts of HPC, including parallel computing, clusters,
and cloud HPC. It's a beginner-friendly introduction to the architecture and uses of HPC.
• Why it helps: Good for understanding the basics of HPC and why it’s important in modern science and
industry.
2. HPC for Healthcare Applications
• Paper: High-Performance Computing in Biomedical Research
• Summary: Discusses the role of HPC in transforming biomedical research by enabling the analysis of
complex data, from genomics to drug discovery.
• Why it helps: Shows how HPC is applied to healthcare challenges like disease diagnosis and personalized
medicine, making it relatable and socially impactful.
3. HPC and Climate Change
• Paper: HPC for Climate and Weather Research: Current Status and Future Prospects
• Summary: This paper explores the use of HPC in climate modeling and weather forecasting. It highlights how
HPC helps in simulating complex environmental systems.
• Why it helps: Climate change is a socially relevant issue, and showing how HPC aids in its study can engage
students interested in environmental science.
4. HPC in Social Impact Projects
• Paper: High-Performance Computing for Humanitarian Assistance and Disaster Relief
• Summary: This paper looks at how HPC is used to simulate and respond to natural disasters,
helping with tasks like evacuation modeling and predicting the impact of earthquakes.
• Why it helps: Provides concrete examples of how HPC directly benefits society by improving
response to disasters.

5. HPC for Solving Energy Problems


• Paper: The Role of High-Performance Computing in Energy Efficiency and Sustainability
• Summary: Discusses how HPC is utilized in designing sustainable energy solutions, including
improving efficiency in solar and wind power technologies.
• Why it helps: Energy sustainability is a highly relevant field. This paper connects HPC to
impactful research in renewable energy, engaging students with the societal benefits.

6. Understanding Parallel Computing Basics


• Paper: An Overview of Parallel Computing for Beginners
• Summary: A simpler introduction to parallel computing, which is fundamental to HPC. It
discusses how parallel processing helps in faster computations.
• Why it helps: It’s a good starting point to understand how HPC works on a technical level,
without diving into too much complexity.
SLURM Resource Manager

* SLURM (Simple Linux Utility for Resource Mangaer)

* Scalable to the largest clusters ( > 16,000 node)

* Allocate resource within a cluster to jobs


– Nodes
– Processors
– Memory
– GPUs

* Launch and manage jobs

* Advance reservation

* Plugin support
Typical List of Software

* AMBER 18
* CUDA 11.6
* CUDNN
* FFMPEG 5.0.1
* FREESURFER
* GROMACS
* LEPTONICA
* MATLAB 2023
* NAMD
* QUANTUM ESPRESSO
* R
* OPENMPI
* SINGULARITY

Mainly used for DL / ML and Molecular Dynamics Simulations


HPC Applications
3. Computational Chemistry and Biology:
1. Scientific Research and Simulation:
 Drug Discovery: Molecular modeling, virtual screening, and
 Climate Modeling: Simulating climate patterns and predicting climate pharmacology research.
change.
 Genomics: DNA sequencing and analysis for understanding genetic
 Astrophysics: Modeling the behavior of galaxies, stars, and cosmic diseases.
phenomena.
 Proteomics: Analyzing protein structures and interactions.
 Molecular Dynamics: Studying the behavior of molecules and proteins for
drug discovery.
4. Financial Modeling:
 Quantum Mechanics: Simulating quantum systems for materials science and
physics research.  Risk Assessment: Analyzing market data to assess financial risks and
make investment decisions.
2. Engineering and Product Design:  Algorithmic Trading: Developing and testing trading strategies.

  Portfolio Optimization: Optimizing investment portfolios for maximum


Aerospace: Aircraft and spacecraft design, aerodynamics, and structural
analysis. return and minimum risk.

 Automotive: Vehicle crash simulations, engine design, and optimization.


5. Weather and Climate Prediction:
 Civil Engineering: Structural analysis, earthquake simulations, and bridge
design.  Weather Forecasting: Providing accurate weather forecasts for short-
 Oil and Gas: Reservoir modeling, seismic data analysis, and exploration. term and long-term predictions.
 Climate Modeling: Simulating and predicting climate patterns for
climate change studies.
HPC Applications
6. Energy and Nuclear Research:

 Nuclear Simulations: Studying nuclear reactions, fusion, and fission processes.


 Renewable Energy: Optimizing wind and solar farm designs for energy production.

7. Artificial Intelligence and Machine Learning:

 Deep Learning: Training large neural networks for image recognition, natural language processing, and more.
 Data Analysis: Analyzing vast datasets for insights and pattern recognition.

8. Computational Fluid Dynamics (CFD):

 Aeronautics: Studying airflow around aircraft and spacecraft.


 Automotive: Analyzing fluid dynamics in engine design and aerodynamics.
 Environmental Impact: Assessing environmental effects of fluid flow in rivers and oceans.

9. Particle Physics:

 Large Hadron Collider (LHC): Analyzing vast amounts of data to discover new particles and understand the fundamental structure of
matter.

10. Cryptography and Security:

 Cryptography: Breaking codes and developing secure encryption methods.


 Cybersecurity: Identifying and mitigating cyber threats through pattern recognition and analysis.
IIIT – HPC Ada cluster

• The cluster consists of 123 nodes, each featuring dual Intel Xeon 2640 v4, Xeon Gold
5317, or AMD EPYC 9124 processors and 128 to 256 GB of RAM. Each node is also
equipped with four NVIDIA 1080 Ti, 2080 Ti, 3080 Ti, 3090, RTX 6000, or L40S GPUs.
• The nodes are connected to each other via a Gigabit Ethernet network. All compute
nodes have a 1.8 TB local scratch and a 960 GB local SSD scratch. The compute
nodes are running Ubuntu 18.04 LTS. SLURM software is used as a job scheduler and
resource manager.
• No. of nodes: 123
• No. of CPU cores: 3408
• No. of GPUs: 484
• Total memory: 20,352 GB
• Storage : ~ 320 TB
• TFLOPS: 98.2 (CPU) + 9987 (FP32 GPU)

• Installed Libraries and Software: CUDA, CuDNN, FFPMEG, FFTW, LAPACK, MKL,
OpenMPI, TensorRT, AMBER 20, Caffe, GROMACS, Leptonica, MATLAB, PyTorch, R,
Quantum Espresso, Singularity, VASP.
Ada Layout

You might also like