0% found this document useful (0 votes)
31 views25 pages

Ebook What Is HPC

This document provides an overview of high-performance computing (HPC), including what it is, common use cases, and HPC cluster architecture. It defines HPC and how it differs from supercomputing, describes use cases like computational fluid dynamics and machine learning, and outlines the components of HPC clusters including compute nodes, head nodes, storage nodes, networking, and scheduling tools. The document aims to introduce readers to the world of HPC.

Uploaded by

KamenPopov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views25 pages

Ebook What Is HPC

This document provides an overview of high-performance computing (HPC), including what it is, common use cases, and HPC cluster architecture. It defines HPC and how it differs from supercomputing, describes use cases like computational fluid dynamics and machine learning, and outlines the components of HPC clusters including compute nodes, head nodes, storage nodes, networking, and scheduling tools. The document aims to introduce readers to the world of HPC.

Uploaded by

KamenPopov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

A guide to high-performance computing

Understanding cluster architectures, tooling and deployment


Contents
Table of Contents 2
Executive summary 3
What is high-performance computing (HPC)? 3
HPC vs supercomputing 3
What are the main use cases for HPC? 4
Computational Fluid Dynamics (CFD) 4
High-performance data analytics (HPDA) 4
AI (Artificial Intelligence) and Machine Learning 5
What are HPC clusters? 6
Workstations and remote visualisation 6
Servers 7
Compute nodes 7
Head nodes 7
Storage nodes 7
Operating system 8
Linux in HPC 8
Cluster provisioning 8
Cluster provisioning solutions 8
Networks 9
Networking solutions 9
Storage 11
General-purpose storage 11
Clustered file system 11
Object storage solutions 12
Storage solutions 12
Scheduling, workloads and workload portability 14
Schedulers 14
Scheduling solutions 15
MPI libraries and libraries for parallel computation 16
What is MPI? 16
MPI Solutions 16
Workloads 17
Workload solutions 18
Containers 19
Container solutions 19
Auxiliary services 20
Monitoring and observability 21
Observability solutions 21
Where do you run HPC clusters? 22
HPC in public clouds 22
Dedicated private HPC clusters 24
Hybrid HPC 24
HPC at the edge 24
Take your next steps in HPC with Canonical 25

2
Executive summary
This ebook is an introductory guide to high-performance computing. It
summarises the different use cases, workload and processing types that exist in
HPC. It gives an overview of HPC clusters and their architecture while examining
where they can be deployed - whether on-premise (in your own hardware) or in
the public cloud. It also highlights the many different components involved in
HPC clusters.

Overall, you will find this guide useful to understand the inner workings of
HPC clusters, their architecture, typical use cases and associated tooling for
HPC implementations. After reading this ebook, you should have a sufficient
understanding of the world of HPC and be equipped to evaluate what you need
to get started.

What is high-performance computing (HPC)?


High-performance computing combines computational resources together
as a single resource. The combined resources are often referred to as a
supercomputer or a compute cluster. HPC makes it possible to deliver
computational intensity and process complex computational workloads and
applications at high speeds and in parallel. Before delving deeper into HPC, let’s
explore how it differs from supercomputing.

HPC vs supercomputing
These days, supercomputing has become a synonym for high-performance
computing. However, they are not exactly interchangeable: supercomputers
and supercomputing generally refer to the larger cluster deployments and
the computation that takes place there. HPC mainly refers to a computation
performed using extremely fast computers on clusters ranging from small-
scale HPC clusters to large supercomputers. Most often, HPC clusters and
supercomputers share the same architecture and are built out of commodity
servers.

Historically, supercomputing was a type of high-performance computing that


took advantage of a special set of systems. Similar to the HPC clusters of today,
they worked on massively complex or data-heavy problems, but comparing the
two is a little bit like comparing apples to pears when it comes to computing
power. Even a mere mobile phone today is more powerful than the first
supercomputers.

For example, some mobile phones can reach a few gigaflops, whereas the CDC
6600, a supercomputer designed by Seymour Cray in the 1960s, was estimated to
deliver about three megaflops. At the time, supercomputers were more powerful
than anything else on the market and very expensive to build and develop. Their
architecture was far superior to the personal computers that were available. That
is why they were called supercomputers.

Unlike current HPC clusters, supercomputers were quite different in terms


of architecture. Ultimately, they were huge multi-processor systems with
very specialised functionality and were generally reserved for the realm of
governments and research universities. Fast-forward to 2023, and you will find
that HPC systems are now applied in a vast array of industries.

3
What are the main use cases for HPC?
HPC is used to solve some of the most advanced and toughest computational
problems we have today. These problems exist in all sectors, such as science,
engineering, or business. Some of the most popular use cases for HPC include:

• Climate modelling and weather prediction


• Oil and gas exploration
• Automotive and aerospace engineering
• Financial analysis and risk assessment
• Drug discovery and protein folding
• Image and video processing
• Reverse encryption and intrusion detection for cybersecurity
• Genomics research and analysis

These use cases are solved with numerical equations - such as those in
computational fluid dynamics (CFD). They analyse or process large data sets
like those in high-performance data analytics (HPDA), artificial intelligence and
machine learning.

Workloads for these different use cases can be classified into one or many
different types, depending on how they are executed or processed. Batch
processing, for instance, involves running a large number of similar jobs in
a sequence. Real-time processing involves processing data in real-time as it
arrives. Interactive processing involves running interactive applications such as
simulations or data visualisations.

Let’s explore some of these use cases in more detail, as they are closely related
to HPC.

Computational Fluid Dynamics (CFD)


Computational Fluid Dynamics (CFD) is a branch of science that uses numerical
methods and algorithms to solve and analyse fluid flows. It is used to study the
motion of solids, liquids and gases, and to analyse and predict the effects of
fluid flows on structures and systems. CFD is an important tool for engineers
and scientists, as it can be used to study the behaviour of complex systems in
a wide range of applications, including aerospace, automotive, and biomedical
engineering.

High-performance data analytics (HPDA)


High-performance data analytics (HPDA) is the process of analysing large
quantities of data quickly and efficiently in order to gain useful insights. It involves
using specialised techniques, hardware, and software to analyse data at scale,
identify patterns and trends and make real-time decisions. High-performance
data analytics can be used in various fields, such as finance, healthcare, and
marketing. The end goal is to improve efficiency and increase profits.

4
AI (artificial Intelligence) and machine learning
AI (artificial intelligence) and machine learning are related fields of computer
science focusing on the development of computer systems that can learn, reason,
and make decisions. AI and machine learning (ML) involve the use of algorithms to
identify patterns and trends in data sets, and to make predictions and decisions
from the data. AI and ML are used in a variety of applications, including data
mining, natural language processing, autonomous vehicles, and more.

So how does HPC work? Let’s explore the components behind HPC cluster
architecture and common tools used to run HPC workloads. First, let’s define
what we mean by an HPC cluster.

What are HPC clusters?


HPC clusters are a collection of resources used for the primary purpose of
running computational workloads.

HPC clusters consist of:

• Workstations that interact with the workload for pre or post-processing.


• Servers that are often deployed as Compute, Head, and Storage Nodes.
• Compute nodes compute the workloads.
• Head nodes are for user access and cluster interaction.
• Storage nodes are for data, general or computational storage.
• An operating system used to operate the servers.
• A cluster provisioner that ensures node homogeneity and is used to deploy the
operating system the servers run on.
• A network for communication between nodes.
• Storage solutions:
• A general-purpose storage solution to store applications and user data.
• A high-speed, low-latency clustered file system generally used for
computational storage.
• Object storage, in some cases.
• Scheduling capabilities.
• Workloads and the libraries that those workloads depend on.
• Auxiliary services including:
• Identity management to keep user access consistent throughout a cluster.
• An observability and monitoring stack that provides insight into workload
resource utilisation.

5
HPC cluster architecture

In the following pages, we will explore these components in more detail.

Workstations and remote visualisation


Users should be able to define workloads and view their results before and after
computation. This generally takes place on workstations. Those can either be
local to the user or accessed remotely. It’s common for HPC workloads to have
a pre and post-processing step. This can sometimes be the creation of a CAD-
based 3D model or a visual simulation, graphical data or numerical results. This
is often accomplished with graphical VDI resources, co-located with the HPC
cluster resources if data transfer considerations exist. Conversely, they can be
implemented with local workstations where data is either transferred to or from
the HPC cluster, as required for local viewing or further work.

Cluster management solutions for VDI with GPU or vGPU support include:

• MAAS and LXD for managing VDI resources


• OpenStack
• Solutions from Citrix

Organisations can also provision workstations remotely for HPC. Solutions that
enable a remote workstation experience include:

• Remote access software such as VNC, which provides access to the desktop
environment.
• Desktop environments such as Ubuntu Desktop running on a VM.
• Desktop workstations running in the cloud, such as Ubuntu in AWS workspaces.

6
Servers
A server is a computer or system that provides resources, data, services, or
programs to other computers, known as clients, over a network. Servers can
provide various functionalities, often called services, such as sharing data or
resources among multiple clients, or performing computations for a client.
Common examples of server types include web servers, application servers,
database servers, and file servers. In high-performance computing, servers are
used for two primary purposes: to compute mathematical models or process
data, and to serve data through file servers. Servers used for computation and
data processing are generally called compute nodes. Servers that serve data are
generally referred to as storage nodes.

Compute nodes
Compute notes are the processing component of an HPC cluster. They execute
the workload using local resources, like CPU, GPU, FPGA and other processing
units. These workloads also use other resources on the compute node for
processing, such as the memory, storage and network adapter. The workloads
use the available bandwidth of these underlying components. Depending on how
the workload uses those components, it can be limited by one or more of those
during execution. For example, some workloads that use a lot of memory might
be limited on memory bandwidth or capacity. Workloads that either use a lot of
data or generate a large amount of data during computation might be limited
in their processing speed due to network bandwidth or storage performance
constraints - if that data is written down to storage as part of the computation
of that workload. Some workloads might just need plenty of computational
resources and be limited by the processing ability of the cluster.

When creating and designing these clusters, it’s important to understand the
resource utilisation of the workload and design the cluster with that in mind. The
best way to understand workload resource usage is by monitoring the resources
used. This allows you to gain an understanding of the limitations.

Head nodes
Head nodes or access nodes act as an entry point into an HPC cluster. It’s where
users interact with the input and output of their workloads and get access to the
local storage systems available to that cluster. It’s also where they schedule their
workloads. The scheduler, in turn, executes processes on compute nodes.

Storage nodes
A storage node is a computer or server responsible for storing and providing
access to data over a network. Storage nodes are typically connected to other
storage nodes in a storage cluster and provide access to data stored on the
cluster. They are often connected to other storage or compute nodes via a high-
speed network, such as InfiniBand or Ethernet, providing access to data directly
or via a file system. Multiple protocols exist to provide storage access, from
traditional NFS share to other shared storage implementations such as Lustre or
BeeGFS.

7
Operating system
To operate the nodes, you need an operating system (OS). The OS is responsible for managing
the computer’s memory, processor, storage, and other components. It also provides an interface
between the user and the servers, allowing users to interact with the computer and execute
programs. Common operating systems used in HPC include Windows, macOS, and Linux.

Linux in HPC

The Linux operating system, probably one of the most recognised open-source projects, has
been both a driver for open-source software in HPC and been driven by HPC use cases. NASA was
an early user of Linux and Linux, in turn, was fundamental to the first Beowulf cluster. Beowulf
clusters were essentially clusters created using commodity servers and high-speed interconnects,
instead of more traditional mainframes or supercomputers. The first Beowulf cluster was
deployed at NASA, and went on to shape HPC as we know it today. It drove Linux adoption from
then onwards in government and expanded well outside that sector into others. Today, this type
of cluster is used by enterprises as well.

HPC has driven a lot of development efforts in Linux, all focused heavily on driving down latency
and increasing performance across the stack - from networking to storage.

Ubuntu is the Linux OS preferred by 66% of developers and it is ideal for HPC. It can be used for
workstations, to access HPC clusters, or installed on servers, giving the user a uniform experience
across both.

Cluster provisioning
Node homogeneity is important in HPC to ensure workload consistency. That’s
why it’s common to see HPC clusters provisioned with metal-as-a-service

solutions that help organisations manage this infrastructure at scale.

Cluster provisioning solutions

MAAS
Metal as a Service or MAAS, is an open source project developed and maintained
by Canonical. MAAS was created with one purpose: API-centric, bare-metal
provisioning. MAAS automates all aspects of hardware provisioning, from
detecting a racked machine to deploying a running, custom-configured operating
system. It makes management of large server clusters, such as those in HPC,
easy through abstraction and automation. It was created to be easy to use, has
a comprehensive UI - unlike many other tools in this space - and is highly scalable
thanks to its disaggregated design. MAAS is split into a region controller which
manages the overall state of the cluster, anywhere from keeping information
on the overall hardware specification to maintaining information about which
servers have been provisioned and which servers are available. Moreover, it makes
all of this available to the user. MAAS also comes with a stateless rack controller
that handles PXE booting and Power Control. Multiple rack controllers can be
deployed, allowing for easy scale out regardless of the environment’s size. It’s
notable that MAAS can be deployed in a highly available configuration, giving it
the fault tolerance that comparable projects in the industry don’t have.
8
Moreover, it makes all of this available to the user. MAAS also comes with a
stateless rack controller that handles PXE booting and Power Control. Multiple
rack controllers can be deployed, allowing for easy scale out regardless of the
environment’s size. It’s notable that MAAS can be deployed in a highly available
configuration, giving it the fault tolerance that comparable projects in the
industry don’t have.

xCAT

Extreme Cloud Administration Toolkit or xCAT, is an open-source project


developed by IBM. Its main focus is on the HPC space, with features primarily
catering to the creation and management of diskless clusters, parallel installation
and management of Linux cluster nodes. It’s also suitable to set up high-
performance computing stacks such as batch job schedulers. It also has the ability
to clone and image Linux and Windows machines. It has some features that
primarily cater to IBM and Lenovo servers. It’s used by many large governmental
HPC sites for the deployment of diskless HPC clusters.

Warewulf

Warewulf’s stated purpose is to be a “stateless and diskless container operating


system provisioning system for large clusters of bare metal and/or virtual
systems”. It has been used for HPC cluster provisioning for the last two decades.
Warewulf has recently been rewritten using Golang in its latest release, Warewulf
v4.

Networks
As mentioned above, parallel HPC workloads heavily depend on inter-process
communication. When that communication takes place within a compute
node, it’s just passed from one process to another through the memory of that
computational node. But when a process communicates with a process on another
computational node, that communication needs to go through the network.
This inter-process communication might be quite frequent. If that’s the case, it’s
important that the network has low latency to prevent communication delays
between processes. After all, you don’t want to spend valuable computation
time on processes awaiting message deliveries. In cases where data sizes are
large, it’s important to deliver that data as fast as possible. That’s enabled by high
throughput networks. The faster the network can deliver data, the sooner any
processes can start working on the workload. Frequent communication and large
message and data sizes are regular features of HPC workloads. This has led to the
creation of specialised networking solutions that often deliver low latency and
high throughput to meet HPC-specific demands.

Networking solutions
As mentioned above, parallel HPC workloads heavily depend on inter-process
communication. When that communication takes place within a compute
node, it’s just passed from one process to another through the memory of that
computational node.

9
Ethernet

Ethernet is the most commonly used technology for providing network


connectivity. To understand Ethernet it’s often important to understand the OSI
model, where connectivity is described in seven layers:

1. Physical
2. Data Link
3. Network
4. Transport
5. Session
6. Presentation
7. Application

This model is comprehensive and caters to a need for reliable communication.


In HPC, where performance and latency is of ultimate importance, the transport
layer provided by Ethernet is sometimes considered inefficient. For example, TCP,
a transport protocol for Ethernet, requires a lot of acknowledging communication
which adds extra overhead. This is less of an issue with UDP, which was not
designed for the same reliability. There have been efforts to improve the
efficiency of Ethernet-based networks. RDMA over Converged Ethernet (RoCE)
is a networking protocol that allows remote, direct memory access (RDMA) over
an Ethernet network; it does this by encapsulating an InfiniBand (IB) transport
packet over Ethernet. This avoids many of the overheads associated with
traditional transport protocols, enabling lower latency, lower CPU load and higher
bandwidth.

Nvidia InfiniBand

InfiniBand is a high-speed networking technology used in HPC clusters and


supercomputers. It is a networking technology used for data communication
between computers and within computers. It is also used to connect servers to
storage systems directly or via a switch, as well as to connect storage systems
together. InfiniBand provides very high speed and very low latency, making
it ideal for storage and high-performance computing applications such as
those that depend on MPI for parallel communication. For example, the latest
generation of InfiniBand offers 400Gb/s of connectivity per port. The latency of
an InfiniBand switch is around 100 ns vs about 230 ns for Ethernet switches.

This is what makes Infiniband a popular option as a high-speed interconnect for


HPC clusters.

HPE Cray Slingshot

Slingshot is compatible with Ethernet, while also delivering capabilities similar


to Infiniband in terms of throughput and latency. The latest generation offers
200Gb/s of connectivity per port. Being Ethernet-based, it offers convenient
features, such as direct switch-to-switch connectivity between HPE Cray
Slingshot switches and traditional Ethernet switches.

10
Cornellis OmniPath

Formerly Intel OmniPath, Cornellis OmniPath is a high-speed interconnect based


on the combination of two technologies that Intel acquired: TrueScale InfiniBand
and the Aries interconnect from the Cray XC supercomputer line. After acquiring
Barefoot networks in 2019, Intel decided to focus on the technology obtained
from that acquisition over OmniPath. The reason for this was that they saw an
opportunity for Switches with programmable ASICS for high-speed interconnect
use cases. Intel spun out its OmniPath-based product line into a new company,
Cornellis, which continues to develop and maintain the OmniPath product line
outside of Intel.

Rockport switchless network

Rockport is an ethernet-based high-speed interconnect solution, which avoids


switches, NICs are directly connected in a large grid and provide connectivity and
routing as needed between each other. NICs that are not directly connected can
connect to other NICs by routing through the connected NICs. In simple terms,
NICs act as switches to provide connectivity.

Storage
Storage solutions in the HPC space are most commonly file-based, with POSIX
support. These file-based solutions can be generally abstracted into two
categories, general-purpose and parallel storage solutions. Other solutions,
such as object storage, or Blob (Binary Large Objects) storage, as it’s sometimes
referred to in HPC, can be utilised by some workloads directly, but not all
workloads have that capability.

General-purpose storage
There are two main uses for general-purpose storage in an HPC cluster. One
would be for the storage of available application binaries and their libraries.
That’s because it’s important that all binaries and libraries are consistent across
the cluster when running an application, making central storage convenient.
The other would be for the user’s home directories and other user data, as it’s
important for the user to have consistent access to their data throughout the
HPC cluster. It’s common to use an NFS server for this purpose, but other storage
protocols do exist that enable POSIX-based file access.

Clustered file system


A clustered file system or parallel file system is a shared file system which serves
storage resources from multiple servers and can be mounted and used by
multiple clients at the same time. This gives clients direct access to stored data,
which, in turn, cuts out overheads by avoiding abstraction, resulting in low latency
and high performance. Some systems are even able to reach performance similar
to total underlying hardware aggregate performance.

11
Object Storage
Object storage solutions are often used for storage in HPC clusters, either to
archive past computational results or other related data. Alternatively, they may
be used directly by workloads that support native object storage APIs.

Storage solutions
There are various storage solutions available, both proprietary and open source.
The ones that are most commonly used in HPC are detailed below.

Ceph

Ceph is an open-source software-defined storage solution implemented based on


object storage. It was originally created by Sage Weil for a doctoral dissertation
and has roots in supercomputing. Its creation was sponsored by the Advanced
Simulation and Computing Program (ASC) which includes supercomputing centres
such as Los Alamos National Laboratory (LANL), Sandia National Laboratories (SNL),
and Lawrence Livermore National Laboratory (LLNL). Its creation started through a
summer program at LLNL. After concluding his studies, Sage continued to develop
Ceph full time, and created a company called Inktank to further its development.
Inktank was eventually purchased by Red Hat. Ceph continues to be a strong open-
source project, and is maintained by multiple large companies, including members
of the Ceph Foundation like Canonical, Red Hat, Intel and others.

Ceph was meant to replace Lustre when it comes to supercomputing, and


through significant development efforts it has added features like CephFS, which
give it POSIX compatibility and make it a formidable files-based network storage
system. Its foundations are truly based on fault tolerance over performance,
and there are significant performance overheads to its storage model based
on replication. Thus, it has not quite reached other solutions’ level in terms of
delivering close to underlying hardware performance. But Ceph at scale is a
formidable opponent as it scales quite well and can deliver an overwhelming
amount of the overall Ceph cluster performance.

Lustre

Lustre is a parallel distributed file system used for large-scale cluster computing.
The word lustre is a blend of the words Linux and Cluster. It has consistently
ranked high on the IO500, a bi-yearly benchmark that compares storage solution
performance as it relates to high-performance computing use cases, and has seen
significant use throughout the TOP500 list, a bi-yearly benchmark publication
focused on overall cluster performance. Lustre was originally created as a
research project by Peter J. Braam, who worked at Carnegie Mellon University,
and went on to found his own company (Cluster File Systems) to work on Lustre.
Like Ceph, Lustre was developed under the Advanced Simulation and Computing
Program (ASC) and its PathForward project, which received its funding through
the US Department of Energy (DoE), Hewlett-Packard and Intel. Sun Microsystems
eventually acquired Cluster File Systems, which was acquired shortly after by
Oracle.

12
Oracle announced soon after the Sun acquisition that it would cease the
development of Lustre. Many of the original developers of Lustre had left Oracle
by that point and were interested in further maintaining and building Lustre
but this time under an open community model. A variety of organisations were
formed to do just that, including the Open Scalable File System (OpenSFS),
EUROPEAN Open File Systems (EOFS) and others. To join this effort by
OpenSFS and EOFS a startup called Whamcloud was founded by several of the
original developers. OpenSFS funded a lot of the work done by Whamcloud.
This significantly furthered the development of Lustre, which continued after
Whamcloud was eventually acquired by Intel. Through restructuring at Intel,
the development department focused on Lustre was eventually spun out to a
company called DDN.

BeeGFS

A parallel file system developed for HPC, BeeGFS was originally developed at
the Fraunhofer Centre for High-Performance Computing by a team around
Sven Breuner. He became the CEO of ThinkParQ, a spin-off company created to
maintain and commercialise professional offerings around BeeGFS. It’s used by
quite a few European institutions whose clusters reside in the TOP500.

DAOS

Distributed Asynchronous Object Storage or DAOS is an open-source storage


solution aiming to take advantage of the latest generation of storage
technologies, such as non-volatile memory or NVM. It uses both distributed Intel
Optane persistent memory and NVM express (NVMe) storage devices to expose
storage resources as a distributed storage solution. As a new contender, it did
relatively well in the IO500 10 node challenge, as announced during ISC HP’22,
where it managed to get 4 places in the top 10. Intel created DAOS and actively
maintains it.

GPFS

IBM General Parallel File System (also known as IBM Spectrum Scale) is a high-
performance clustered file system used by many commercial HPC cluster
deployments as an accelerated storage solution. It can also be found in multiple
supercomputing clusters on the TOP 500 list. GPFS started as the Tiger Shark
file system, a research project at IBM’s Almaden Research Center in 1993, initially
designed for throughput multimedia applications. This throughput-focused
design proved to be an excellent fit for scientific computing.

VAST Data

VAST Data is a relatively new player in the storage market that offers storage
appliances that leverage some of the latest technologies. For example, they use
Intel Optane / 3D XPoint NVMe SSDs and 3D XPoint-based non-volatile memory
as part of their data architecture. These act as an accelerated data tier in front of
more cost-effective higher, density NAND Flash-based SSDs. VAST Data can be
connected through either NVMe-oF using Ethernet or InfiniBand and supports
RDMA for NFS version 3.

13
Weka

Weka is an appliance-based clustered storage solution self-described as the


“Data Platform for the Cloud & AI era”. It is POSIX-Compliant and provides
access through S3, NFS, and SMB-based backends. Like VAST Data, it is a relative
newcomer in the space and supports some relatively new technologies, such as
NVMe over Fabric and NVIDIA GPUDirect Storage.

PanNFS

PanNFS, created by Panasas, is a clustered file system that supports DirectFlow


(pNFS), NFS and CIFS protocols for data access. Panasas were a key contributor to
Parallel NFS (pNFS), allowing clients to process file requests to multiple servers
or storage devices simultaneously instead of handling them one at a time. This
feature became part of the NFS 4.1 standard.

Scheduling, workloads and workload portability


So far we have discussed the fundamental components of HPC clusters such
as the hardware, storage solutions, provisioning and operating system. But
HPC clusters don’t just depend on these fundamentals; software is vital for the
operation of clusters. Schedulers are used to optimise cluster usage and ensure
workload execution. MPI libraries are used by workloads to enable parallel
communication, which is key for workloads to span clusters instead of running on
a single machine. The workloads themselves are also software that provides the
foundation for the computation to take place. And finally, workload portability is
becoming more important than ever, which is why we are seeing the increasing
use of containers in HPC.

Schedulers
In HPC, a scheduler queues up workloads against the resources of the cluster
in order to orchestrate its use. Schedulers act as the brain for the clusters.
They receive any requests for workloads that need to be scheduled from users
of the cluster, keep track of them and then run those workloads as needed
when resources are available. Schedulers are aware of any resource availability
and utilisation and do their best to consider any locality that might affect
performance. Their main purpose is to schedule compute jobs based on optimal
workload distribution. The schedule is often based on organisational needs.

The scheduler keeps track of the workloads and sends workloads over to another
integral component: an application process that runs on the compute nodes to
execute that workload.

14
Scheduling solutions
SLURM workload manager

Formerly known as Simple Linux Utility for Resource Management, SLURM is


an open-source job scheduler. Its development started as a collaborative effort
between Lawrence Livermore National Laboratory, SchedMD, HP and Bull.
SchedMD is currently the main maintainer and provides a commercially supported
offering for SLURM. It’s used on about 60% of the TOP500 clusters and is the
most frequently used job scheduler for large clusters. SLURM can currently be
installed from the Universe repositories on Ubuntu.

Open OnDemand

Not a scheduler per say, but deserves an honourable mention with SLURM. Open
OnDemand is a user interface for SLURM that eases the deployment of workloads
via a simple web interface. It was created by the Ohio Supercomputing Centre
with a grant from the National Science Foundation.

Grid Engine

A batch scheduler that has had a complicated history, Grid Engine has been
known for being open source and also closed source. It started as a closed source
application released by Gridware but after their acquisition by Sun, it became Sun
Grid Engine (SGE). It was then open sourced and maintained until an acquisition
by Oracle took place, at which point they stopped releasing the source and it
was renamed Oracle Grid Engine. Forks of the last open source version soon
appeared. One called Son of Grid Engine, which was maintained by the University
of Liverpool and no longer is (for the most part). Another called Grid Community
Toolkit is also available but not really under active maintenance. A company called
Univa started another closed source fork after hiring many of the main engineers
of the Sun Grid Engine team. Univa Grid Engine is currently the only actively
maintained version of Grid Engine. It is closed sourced and was recently acquired
by Altair. The Grid Community Toolkit Grid Engine manager is available on Ubuntu
under the Universe repositories.

OpenPBS

Portable Batch System (PBS) was originally developed for NASA, under a
contract by MRJ. It was made open source in 1998 and is actively developed.
Altair now owns PBS, and releases an open-source version called OpenPBS.
Another fork exists that used to be maintained as open source but has since gone
closed source. It’s called Terascale Open-source Resource and QUEue Manager
(TORQUE) and it was forked and maintained by Adaptive Computing. PBS is
currently not available as a package on Ubuntu.

HTCondor

HTCondor is a scheduler in its own right, but differentiated compared to the


others, as it was written to make use of unused workstation resources instead of
HPC clusters. It has the ability to execute workloads on idle systems and kills them
once it detects activity. HTCondor is available on Ubuntu in the Universe package
repository.

15
Kubernetes

Kubernetes is a container scheduler that has gained a loyal following for


scheduling cloud-native workloads. Interest in expanding the use of Kubernetes
in more compute-focused workloads that depend on parallelisation has grown.
Some machine learning workloads have even built up a substantial ecosystem
around Kubernetes, sometimes driving a need to deploy Kubernetes as a
temporary workload on a subset of resources to handle workloads that so heavily
depend on it. There are also efforts to expand the overall scheduling capabilities
of Kubernetes to better cater to the needs of computational workloads, so
efforts are ongoing.

MPI libraries and libraries for parallel computation


While you can run HPC workloads on a single server or node, the real potential
of high-performance computing comes from running computationally
intensive tasks as processes across multiple nodes. These different processes
work together in parallel as a single application. You need a message passing
mechanism to ensure communication between processes across nodes. The
most common implementation of this in HPC is known as MPI (Message Passing
Interface).

What is MPI?
MPI is a communication protocol and a standard used to enable portable
message passing from the memory of one system to another on parallel
computers. Message-passing allows computational workloads to be run across
compute nodes connected via a high-speed networking link. This was vital to the
development of HPC, as it allowed an ever greater number of organisations to
solve their computational problems at a lower cost and at a greater scale than
ever before. Suddenly, they were no longer limited to the computational ability of
a single system.

MPI libraries provide abstractions that enable point-to-point and collective


communication between processes. They are available for most programming
languages and are used by most parallel workloads to reach unparalleled scale
across large clusters.

MPI solutions
OpenMP

OpenMP is an application programming interface (API) and library for


parallel programming that supports shared-memory multiprocessing. When
programming with OpenMP, all threads share both memory and data. OpenMP is
highly portable and gives programmers a simple interface for developing parallel
applications that can run on anything from multi-core desktops to the largest
supercomputers. OpenMP enables processes to communicate with each other
within a single node in an HPC cluster, but there is an additional library and API for
processing between nodes. That’s where MPI or Message Passing Interface comes
in, as it allows a process to communicate between nodes. OpenMP is available on
Ubuntu through most compilers, such as GCC.

16
OpenMPI

OpenMPI is an open-source implementation of the MPI standard, developed and


maintained by a consortium of academic, research and industry partners. It was
created through a merger of three well-known MPI implementations that are
no longer individually maintained. The implementations were FT-MPI from the
University of Tennessee, LA-MPI from Los Alamos National Laboratory and LAM/
MPI from Indiana University. Each of these MPI implementations were excellent
in one way or another. The project aimed to bring the best ideas and technologies
from each into a new world-class open-source implementation that excels overall
in an entirely new code base. OpenMPI is available on Ubuntu in the Universe
package repository.

MPICH

Formerly known as MPICH2, MPICH is a freely available open-source


implementation. Started by Argonne National Laboratory and Mississippi State
University, its name comes from the combination of MPI and CH. CH stands for
Chameleon, which was a parallel programming library developed by one of the
founders of MPICH. It is one of the most popular implementations of MPI, and is
used as the foundation of many MPI libraries available today, including Intel MPI,
IBM MPI, Cray MPI, Microsoft MPI and the open-source MVAPICH project. MPICH
is available on Ubuntu in the Universe package repository.

MVAPICH

Originally based on MPICH, MVAPICH is freely available and open source. The
implementation is being led by Ohio State University. Its goals are to “deliver
the best performance, scalability and fault tolerance for high-end computing
systems and servers’’ that use high-performance interconnects. Its development
is very active, and multiple versions are available that provide optimal hardware
compatibility and the best possible performance for the underlying fabric.
Notable developments include its support for DPU offloading, where MVAPICH
takes advantage of underlying SmartNICs to offload MPI processes. SmartNICs
and Data Processing Units (DPU) are an advanced form of network cards, which
have the traditional components of a computer, such as a CPU. This allows them
to act as a computer, have their own operating system and even process data or
networking traffic that goes through them. This allows them to process some
of the host’s workload functions, for example. With MVAPICH, this could, for
example, be handling the MPI communication allowing the host’s processors to
focus entirely on the workload.

Workloads
Many HPC workloads come from in-house or open-source development, driven by
a strong community effort. Often these workloads come from a strong research
background, initiated through University work or national interests, often serving
multiple institutes or countries. When it comes to open source there are plenty
of workloads covering all sorts of scenarios - anything from weather research to
physics.

17
Workload solutions
BLAST

Basic Local Alignment Search Tool or BLAST is an algorithm in bioinformatics


for comparing biological sequence information, such as those in protein or
the nucleotides of DNA or RNA sequences. It allows researchers to compare a
sequence with a library or database of known sequences, easing identification. It
can be used to compare sequences found in animals to those found in the human
genome, helping scientists identify connections between them and how they
might be expressed.

OpenFOAM

Open-source Field Operation And Manipulation or OpenFOAM, as it’s better


known, is an open-source toolbox used to develop numerical solvers for
computational fluid dynamics. OpenFOAM was originally sold commercially
as a program called FOAM. However, it was open-sourced under a GPL licence
and renamed to OpenFOAM. In 2018, a steering committee was formed to set
the direction of the OpenFOAM project; many of its members come from the
automotive sector. Notably, OpenFOAM is available in the Ubuntu package
repositories.

ParaView

ParaView is an open-source data analysis and visualisation platform written in a


server-client architecture. It’s often used to view results from programs such as
OpenFOAM and others. For optimal performance, the rendering or processing
needs of ParaView can be spun up as a scheduled cluster job allowing the use of
clustered computational resources to assist. ParaView can also be run as a single
application; it does not depend on being run exclusively on clusters through its
client-server architecture. ParaView started through a collaboration between
KitWare Inc and Los Alamos National Laboratories, with funding from the US
department of energy. Since then, other national laboratories have joined the
development efforts. ParaView is available in the Ubuntu package repositories.

WRF

Weather Research & Forecasting or WRF Model is an open-source mesoscale


numerical weather prediction system. It supports parallel computation and
is used by an extensive community for atmospheric research and operational
forecasting. It’s used by most of the identities involved in weather forecasting
today. It was developed through a collaboration of the National Center
for Atmospheric Research (NCAR), the National Oceanic and Atmospheric
Administration (NOAA), the U.S. Air Force, the Naval Research Laboratory, the
University of Oklahoma, and the Federal Aviation Administration (FAA). It’s a truly
multidisciplinary and multi-organisational effort. It has an extensive community
of about 56,000 users in over 160 countries.

18
Fire Dynamics Simulator and Smokeview

Fire Dynamics Simulator (FDS) and Smokeview (SMV) are open-source


applications created through efforts from the National Institute of Standards
and Technology (NIST). FDS is a computational fluid dynamics (CFD) model of
fire-driven fluid flow. It uses parallel computation to numerically solve a form of
the Navier-Stokes equations. This is appropriate for low-speed, thermal-driven
flow, which applies to the spread and transmission of smoke and heat from fires.
Smokeview (SMV) is the visualisation component of FDS and is used for analysing
the output from FDS. It allows users to better understand and view the spread of
smoke, heat and fire. It’s often used to understand large structures and how they
might be affected in such disaster scenarios.

Containers
HPC environments often depend on complex dependencies to run workloads.
A lot of effort has been put into the development of module-based systems
such as Lmod, which allow users to load applications or dependencies, like
libraries, outside of normal system paths. This is often due to a need to compile
applications against a certain set of libraries which depend on specific numerical
or vendor versions. To avoid the complex set of dependencies, organisations can
invest in containers. This effectively allows the user to bundle up an application
with all its dependencies into a single executable application container.

Container solutions
LXD

LXD is a next-generation system container and virtual machine manager. It offers


a unified user experience around full Linux systems running inside containers
or virtual machines. Unlike most other container runtimes, it allows for the
management of virtual machines. Its ability to run full multi-application runtimes
is unique. One can effectively run a full HPC environment inside an LXD container
providing abstraction and isolation at no cost to performance.

Docker

The predominant container runtime for cloud-native applications, has seen some
usage in HPC environments. Its adoption has been limited in true multi-user
systems, such as large cluster environments, as Docker fundamentally requires
privileged access. Another downside often mentioned is the overall size of Docker
images, which is attributed to application dependencies, including MPI libraries.
This often creates large application containers that might easily duplicate
components of other application containers. However, when done right, Docker
can be quite effective for dependency management when it comes to developing
and enabling a specific hardware stack. It allows the packaging of applications to
depend on a unified stack. This has some strengths. For example, it avoids storing
multiple dependencies by having dependent container images. You can see this to
great effect in Nvidia NGC containers.

19
Singularity

Singularity or Apptainer - the name of the most recent fork- is an application


container effort that tries to address some of the perceived downsides of Docker
containers. It avoids dependencies on privileged access, making them fit quite
well into large multi-user environments. Instead of creating full application
containers with all dependencies, it can be executed by system-level components
such as MPI libraries and implementations, creating leaner containers with more
specific purposes and dependencies.

Charliecloud

Charliecloud is a containerisation effort that’s in some ways similar to Singularity


and uses Docker to build images that can then be executed unprivileged by the
Charliecloud runtime. It’s an initiative of Los Alamos National Laboratory (LANL).

Auxiliary services
Many software components can be used to improve the usage of HPC
clusters. These include anything from identity management to monitoring and
observability software.

Identity management

Identity access managers are quite common in HPC clusters. They serve as the
single source of truth for identity and access management. Unified access makes
it easy for users to access any node in the cluster. This is often a prerequisite for
resource scheduling. For example, if you want to run a parallel job across multiple
nodes in the cluster via a batch scheduler, you need consistent access to compute
nodes and storage resources. An identity management solution can help you
ensure consistency. Without it, as an administrator, you would need to ensure
both user creation, identity and storage configurations, all lined up across the
cluster through individually configured nodes.

Identity management solutions

LDAP

LDAP (Lightweight Directory Access Protocol) is used to access directory services,


such as those provided by Active Directory. LDAP provides a way for clients to
query and update directory information over a network, such as user accounts,
passwords, and access rights. LDAP is a standard for accessing directory services.
The LDAP protocol is used by a variety of projects including GLAuth, OpenLDAP
and FreeIPA.

Active Directory

Active Directory is a directory services solution created by Microsoft for Windows


networks. It provides a centralised location for storing, managing, and securing
user and computer information, including user accounts, passwords, and access
rights. Active Directory also provides tools for managing network resources, such
as shared folders, printers, and applications.

20
FreeIPA

FreeIPA is an open-source identity and access management solution created


by Red Hat. It’s an integrated identity and authentication solution for Linux
environments. It provides centralised authentication, authorisation and account
information. FreeIPA is built on top of a variety of open-source solutions, such
as the 389 Directory Server which provides an LDAP server. Authentication and
Single-Sign-on is provided through MIT Kerberos KDC. Certificate Authority is
provided by the Dogtag certificate system. All in all, it’s a comprehensive solution
built on top of an extensive list of open-source solutions.

Monitoring and observability

Monitoring and observability tools provide deeper insight into workload resource
utilisation and are thus key to solving any performance issues or detecting issues
with overall cluster health. Metrics that are often observed in HPC clusters
include CPU and memory utilisation, network and memory bandwidth, and
scheduler metrics such as workload throughput - which measure the amount of
jobs being completed in a given period. Job wait times and job completion time
metrics, as well as scheduler queue utilisation metrics are also key.

Monitoring and observability in HPC used to be limited to monitoring tools that


would have as little impact on workloads as possible. These days, that is less
important due increased hardware efficiency. But of course it’s always a good
idea to verify workload run time with and without any extra tooling to establish
baselines for workload job runtimes.

These changes have made the modern monitoring stack more relevant for HPC
cluster monitoring. Modern solutions like Prometheus and Grafana are becoming
more visible in these clusters.

Observability solutions

Prometheus

Prometheus is an open-source monitoring system used to collect, store, and


analyse metrics from applications and services. Prometheus provides a query
language and metric storage, as well as alerting and other features. It is used
to monitor applications and services in data centres, cloud environments, and
Kubernetes clusters.

Grafana

Grafana is a popular open source monitoring platform used to visualise, analyse,


and alert on metrics from systems and services. It supports a variety of data
sources, including Prometheus, and provides a rich set of features for building
dashboards and alerting on metrics. Grafana is used to monitor and alert on
system performance, application performance, and other metrics.

21
Grafana Loki

Loki is an open source log aggregation system designed to be used in cloud


native environments. It is designed for logs from services such as Kubernetes,
Prometheus, and Grafana, and provides a way to store, query, and analyse log
data. Loki provides powerful search capabilities, and can be used to troubleshoot
issues, monitor system performance, and more.

Canonical Observability Stack

The COS stack combines Prometheus, Grafana and Loki into a single deployable
solution providing an all-round monitoring solution for cluster monitoring which
gives a comprehensive overview of metrics.

Now that we have covered the different components of HPC clusters, you might
be wondering where to run them: on public or private clouds?

Where do you run HPC clusters?


HPC clusters can now be deployed almost anywhere. Thanks to ever-growing
technology improvements, running HPC clusters in the cloud has had a
huge growth in popularity in recent years. Some even combine these options
and have a private cost-optimised cluster, then burst into the cloud as needed,
taking advantage of hybrid cloud methodologies for clustered computing. With
growing compute power, some organisations are even able to run HPC clusters
at the edge.

HPC in public clouds

Many public cloud providers offer specialised resources with deep foundations
in the HPC space that are available for consumption to organisations of all sizes.
Cloud computing has made HPC possible for organisations that might require
bursting or scaling beyond what is reasonable with dedicated clusters. It’s now
also possible to run small experimental clusters aimed at those getting started
with HPC, who may not have the capacity to maintain the infrastructure required
for a private cluster. Resources for experimentation or testing, such as GPU, FPGA
or other architectures that might be in the beginning phase of adoption are also
available. Let’s explore what different public cloud vendors offer in the area of
HPC.

Amazon Web Services

AWS has been one of the key players when it comes to driving innovation in
providing public cloud services for HPC. Their implementation of the AWS Nitro
System was key for them to eliminate virtualisation overhead and enable direct
access to underlying host hardware. This drove down latency and increased
performance, vital to running HPC clusters and workloads in a public cloud. In
order to be able to deliver on the demands of HPC workloads when it comes
to inter-node communication, they developed the Elastic Fabric Adaptor which
was key to reducing latency and increasing the performance for workloads that
communicate across nodes and require a high-performance interconnect.

22
To cover the storage needs of HPC users, Amazon added a specialised storage
offering based on Lustre, called Amazon FSx for Lustre. Alongside that, they have
scheduling solutions such as AWS ParallelCluster and AWS Batch.

Azure

Azure is a key player when it comes to driving HPC in the public cloud and has
provided strong instance types that use traditional HPC technologies such
as Infiniband, which provides RDMA functionality for optimal latency and
performance. They also have instance types that cater to those looking to reduce
the number of cores exposed to the workload catering to workloads primarily
limited by memory bandwidth rather than available cores. They even have an
offering that delivers supercomputers as a service, their Cray solution. Along with
that, they offer HPC-focused storage with Cray ClusterStor.

Google Cloud Platform

Google Cloud Platform offers pre-configured HPC VMs. Their offerings also
consist of automation and scripting, making it easy to generate Terraform-based
scripts that handle the provisioning of a Google Cloud-based HPC environment.
Google makes it easy to spin up an environment that fits the user’s needs.
They also have some documentation that allows users to take steps similar to
what Terraform-based infrastructure automation offers, with clear guides for
anything on MPI workloads and HPC images. They give users clear and practical
information on how to get the most out of their usage of the cloud for HPC
workloads.

Oracle Cloud Infrastructure

Oracle was an early player when it came to the enablement of HPC in public
clouds. They take a bare metal approach to HPC in the public cloud, offering
instance types with ultra-low latency RDMA networking. The resulting solution is
close to what one might expect from a dedicated private HPC cluster.

23
Dedicated private HPC clusters
Private clusters are a solid option in HPC for those looking to optimise on cost,
control, and even particular data ownership or security requirements. There
are solutions that give users cloud-like management capabilities for local on-
premise resources. The main challenge with private HPC clusters is the high
upfront investment and required expertise. This can be mitigated by working
with partners such as Canonical who give you access to expert knowledge and
solutions that make adoption more feasible. The foundations for such clusters
rely on cluster provisioning solutions such as MAAS, which we covered in the
cluster provisioning section above.

Hybrid HPC
Hybrid usage of private and public cloud-based resources has been very
popular in the HPC space. Hybrid clouds give users the best of both worlds:
the cost optimisation and control offered by on-premise servers, along with
the extreme scalability of public cloud clusters. In a way, hybrid clouds deliver
a complementary solution where the negatives of one get mitigated by the
positives of the other. The main additional challenge coming from such a setup
might be increased complexity, but overall it has the possibility to bring greater
overall resiliency. With solutions for both public and private clouds, Canonical can
help you simplify the increased complexity of the setup.

HPC at the edge


Many of the various HPC workloads, especially those that require real-time
processing or are extremely-latency sensitive, are now being deployed at the
edge. That means they are often in small clusters or even as a single, very focused
computer often referred to as a high-performance computer (HPC).

24
Take your next steps in HPC with Canonical
Canonical can help you take the next steps in your HPC journey. Our solutions can
help you meet your HPC needs from the operating system layer to infrastructure
automation and more, across clouds and on-premise. Ubuntu is the ultimate
Linux distribution for high-performance computing. Some of the benefits Ubuntu
provides include:

• Recent kernel
• Extensive package repositories
• 2-year fixed release cadence for LTS releases with 5-year support
• Maintenance and bug fixes extendable to 10 years with Ubuntu Pro

Perfect for long-running environments. With Ubuntu Pro, you can make sure your
environment is supported throughout its lifetime.

For on-premise deployments, you can level up your server provisioning process
using MAAS, trusted by a number of organisations that depend on on-premise
HPC clusters. Its highly available architecture makes MAAS fault-tolerant and
makes sure it can be deployed at scale. No matter the size of your cluster, you can
trust MAAS to provide provisioning capabilities and the ultimate cloud experience
in bare metal cluster management, delivering the ultimate performance and
flexibility.

Juju, our solution for infrastructure automation, can help you get a SLURM-
based cluster up and running and ready for users, and thanks to Juju your day 2
operations are taken care of. Juju can be used across public cloud endpoints and
can be used for on-premise deployments with MAAS. To take advantage of cloud-
native deployments, consider Charmed Kubernetes from Canonical.

For more complex infrastructure deployments we have Charmed OpenStack,


which allows you to have your own cost-optimised cloud, delivering the most
value for performance with full sovereignty. Charmed Ceph, a solution for those
looking for a fault-tolerant storage solution, helps you build on open-source
software delivering either file-based shared storage or object storage.

You can use a combination of solutions for a proper hybrid cloud strategy, and run
your workloads depending on your needs. Whatever your requirements are, and
no matter the computation size, Canonical has the solutions for you.

Learn more at ubuntu.com/hpc.


Contact us, and we’ll help you map out your needs.

© Canonical Limited 2023. Ubuntu, Kubuntu, Canonical and their associated logos are the registered trademarks of Canonical Ltd. All
other trademarks are the properties of their respective owners. Any information referred to in this document may change without

Canonical
notice and Canonical will not be held responsible for any such changes.

Canonical Limited, Registered in Isle of Man, Company number 110334C, Registered Office: 2nd Floor, Clarendon House, Victoria
Street, Douglas IM1 2LN, Isle of Man, VAT Registration: GB 003 2322 47

You might also like