HPC - IBM Architecture
HPC - IBM Architecture
Tobias Elpelt
Rico Franke
Yanil Zeledón Miranda
Redpaper
International Technical Support Organization
April 2018
REDP-5478-00
Note: Before using this information and the product it supports, read the information in “Notices” on page v.
This edition applies to high performance computing and artificial intelligence networking running on IBM
Power Systems servers.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
GPFS™ IBM Spectrum Archive™ POWER®
IBM® IBM Spectrum Protect™ Power Systems™
IBM Elastic Storage™ IBM Spectrum Scale™ Redbooks®
IBM Spectrum™ OpenCAPI™ Redbooks (logo) ®
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.
Other company, product, or service names may be trademarks or service marks of others.
This paper helps you understand the basic requirements of designing a solution, the
components in an infrastructure for HPC and AI systems, the designing of interconnect and
data networks with use cases based in real life scenarios, and the administration and the
out-of-band (OOB) management networks.
This paper covers all the requirements, provides a good understanding of the technology, and
includes examples for small, medium, and large cluster environments. This paper is intended
for IT architects, system designers, data center planners, and system administrators who
must design or provide a solution for the infrastructure of an HPC cluster.
Authors
This paper was produced by a team of specialists from around the world working at the
International Technical Support Organization, Austin Center.
Tobias Elpelt is an IT Specialist for HPC in IBM Germany. He joined IBM in 2010 and has
implemented and operated several small and large HPC environments. He holds a degree in
Applied Computer Science. His background includes work with Linux and networks. He is
Mellanox Professional Certified.
Rico Franke works as an IT Specialist for large Linux and HPC environments in Germany. He
has more than 15 years of experience in supporting IBM product, and provides account
support in the context of open source solutions. He leads the IBM operational support team at
the Leibniz Supercomputing Centre, which services the warm-watercooled HPC cluster
SuperMUC. He holds a degree of engineering in information technology.
Yanil Zeledón Miranda is a Solutions Architect at IBM Costa Rica. He joined IBM in 2016.
His background includes network and data center architecture with over 15 years of
experience. He holds certifications for Cisco CCNA, CCDA, CCAI, CCNP, CCDP, CCIP,
Alcatel-Lucent AQPS and ACPS, Juniper JNSS, EMC SE, BlueCat SE, CMNA, and Certified
Trainer.
Henry Brandt, Klaus Gottschalk, Markus Hilger, Florin Manaila, Dino Quintero
IBM
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks® publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
This chapter describes the conventional design requirements for any HPC cluster. With a
clear overview of these topics, you have a better understanding of why the interconnect
network is important in the cluster and how each part fits in the solution.
A conventional design requires many adjustments to its components until the technical
requirements are met. Integration with other teams is essential to reach such criteria and
integration. You must have a common understanding of the design and the viability of the
solution.
However, analytics and cognitive computing requires balanced, data-centric HPC systems.
The emphasis on the capabilities of memory, network, and I/O performance relative to
processor performance increases from descriptive to predictive to prescriptive to learning.
For more information, see the case study at HPC and HPDA for the Cognitive Journey with
OpenPOWER.
The performance of the fabric affects the performance of the entire solution because the
solution conforms to the backbone of the cluster.
Scale out: You add more nodes and switches to the infrastructure, which is limited to the
number of ports that are available from the network topology. Scaling out leads to an
increase in latency and the requirement for more ports.
Figure 1-3 shows a scale-out solution.
1.3 Simplicity
A solution should be simple to deploy, which shows a clear understanding of the technical
requirements, and simple to operate and integrate with the tools that are used to manage it.
Consider the simplicity of the toaster. It has not changed much in the past 100 years, is easy
to deploy, easy to operate, and integrates well into most kitchens. Over time, the user
demands have not changed the basic requirements.
Imagine a cluster with 648 nodes on a fully non-blocking topology that uses a basic
description. There are two general approaches to creating a system:
Approach 1 Use 54 one-rack unit (RU) fix switches with 36 ports each on a 2-layer
fat-tree topology. You need a total of 648 cables between the switches
for a non-blocking solution.
Approach 2 Use one 648-port director switch (for example, a Mellanox InfiniBand
SX6536).
High-performance applications can run for hours, days, weeks, or months. If there is a failure,
such as a transmission failure (bit error), the software retransmits the missing information. If
there is a hardware failure (for example, bad node, broken cable, or broken switch), the
process stops, no retransmission is possible, and the entire data set might be lost.
There are many factors that can affect the availability of the network, such as the mean time
between failures (MTBF), which is a measure of how reliable the hardware is. An HPC
network has several nodes that are interconnected through several switches, and the MTBF is
reduced inversely proportional to the amount of equipment. When one variable increases,
other variables decrease in proportion. Again, redundancy is needed, and in this case,
failover.
Defining a good redundancy strategy when you design the solution impacts the design and
purchase costs. Redundancy has benefits such as avoiding a single point of failure and an
increase in bandwidth. A balance between redundancy and cost efficiency can be achieved if
important switches are redundant and not the entire cluster, or by connecting nodes by using
two cables instead of one.
When you design a solution, it should be simple, cost-effective, and redundant where possible
(for example, on the top of the fat-tree topology). Consider factors such as MTBF for reliability,
and be proactive by using tools for monitoring and troubleshooting. Tools, such as the Open
Fabrics Enterprise Distribution (OFED) package, are described later in this paper.
To calculate the energy efficiency of a system, the primary option is performance per watt,
although for this discussion it should be FLOPS Per Watt, which is used to measure the rate
of processing that can be delivered by a node for every watt of power that is consumed. The
systems have lower consumption when idle and higher consumption when they reach peak
performance. Power consumption directly affects the cooling system because lower
performance means less cooling.
Cables affect power consumption because the increase in distance requires more power.
There are two types of cables:
Passive cables Consume less power because they do not have electronic
transceivers.
Active cables Consume more power because they require different types of
transceivers to operate.
On the out-of-band (OOB) management network (the dedicated infrastructure for managing
all the devices that are connected to the HPC system), you can use Gigabit Ethernet
Switches that use the Energy-Efficient Ethernet (EEE) standard (IEEE 802.3az), which
enables less power consumption during periods of low data activity. When the node decides
that no data needs to be sent, it sends low-power idle (LPI) messages to the switch. Then, the
node and switch periodically exchange messages to maintain the circuit as active even when
the transmit path is in sleep mode.
Your design focus always is on performance, even if there are some savings on power
consumption.
1.6 Congestion
The performance of the HPC cluster is limited to the interconnect fabric when the fabric
becomes saturated, which significantly degrades the performance across the HPC cluster. In
larger fabrics, the degradation is reflected in the workload not having the expected
performance. This situation might be an indication of a congestion problem.
Then, this chapter provides a description of the interconnect that acts as the backbone of the
system. There are two basic technologies that can be used to provide the backbone
interconnect: InfiniBand and Ethernet with RDMA over Converged Ethernet (RoCE). For
management and out-of-band (OOB) management, the Ethernet network is important too.
The last topic is an overview of the applications and libraries that are part of the solution.
Power Systems servers are deployed in many of the largest HPC clusters in the world.
Configured into highly scalable Linux clusters, IBM Power Systems servers offer extreme
performance for demanding workload, such as genomics, finance, computational chemistry,
oil and gas exploration, and high-performance data analytics. An HPC cluster is a
combination of high-performance compute nodes, a low-latency interconnect fabric with high
bandwidth, high-performance parallel storage, and system software, which addresses the
most challenging requirements for HPC and high-performance data analytics.
Although HPC and big data analytics are converging, traditional HPC clusters are built for
another era, meaning that they are designed for data or computation throughput, but not both.
These clusters cannot deliver adequate performance and scalability, and along with I/O
bottlenecks and network latency when moving large data sets, the clusters slow down
real-time insights. IBM HPC clusters deliver more when they are built with Power Systems
because you can easily handle demanding workloads and high-performance data analytics.
The IBM and NVIDIA partnership was announced in November 2013 for integrating IBM
POWER® systems with NVIDIA GPUs and the enablement of GPU-accelerated applications
and workloads. The goal of this partnership is to deliver higher performance and better
energy efficiency to companies and data centers.
The computational capability that is provided by the combination of NVIDIA Tesla GPUs and
IBM Power Systems servers enables workloads from scientific, technical, and HPC to run on
data center hardware. (In most cases, these workloads run on supercomputing hardware.)
This computational capability is built on top of massively parallel and multithreaded cores with
NVIDIA Tesla GPUs and IBM POWER architecture processors, where processor-intensive
operations are offloaded to GPUs and coupled with the system’s high memory-hierarchy
bandwidth and I/O throughput.
In summary, IBM Power Systems servers with NVIDIA GPUs provide a computational
powerhouse for running applications and workloads from several scientific domains, and for
processing massive amounts of data. This data is sent across the network and shared with
the HPC cluster.
This section describes the different parts and interconnects from the NVIDIA Tesla GPU
through IBM Power Systems servers.
2.1.2 NVLINK
NVLINK is the NVIDIA advanced interconnect technology for GPU-accelerated computing. It
increases performance for both GPU-to-GPU communications and GPU access to system
memory.
PCIe uses a serial interface and enables point-to-point interconnections between devices by
using a directly wired interface between these connection points. A single PCIe serial link is a
dual-simplex connection that uses two pairs of wires, one pair for transmit and one pair for
receive, and can transmit only 1 bit per cycle. These two pairs of wires are called a lane. A
PCIe link uses multiple lanes.
The benefits of using CAPI include the ability to access shared memory blocks directly from
the adapter, perform memory transfers directly between the adapter and processor cache,
and reduce the code path length between the adapter and the processors. This reduction in
the path length might occur because the adapter is not operating as a traditional I/O device,
and there is no device driver layer to perform processing. CAPI also presents a simpler
programming model.
Although CAPI is part of IBM POWER processors and is IBM intellectual property (the
Processor Service Layer (PSL)), several industry solutions benefit from the mechanism of
connecting different devices to the processor with low latency, including memory attachment.
The PCIe standard is pervasive in processor technology, but its design characteristics and
latency do not allow the attachment of memory for load/store operations.
Therefore, the IBM OpenCAPI™ Consortium was created, with the goal of defining a device
attachment interface to open the CAPI interface to other hardware developers and extending
its capabilities. OpenCAPI aims to allow memory, accelerators, network, storage, and other
devices to be connected to the processor through a high-bandwidth, low-latency interface to
become the interface of choice for connecting high-performance devices.
The design of OpenCAPI enables low latency when accessing attached devices (nearly in the
same range of system memory access), which enables memory to be connected through
OpenCAPI and serve as main memory for load/store operations. In contrast, PCIe latency is
10 times bigger. Therefore, OpenCAPI is a significant enhancement compared to traditional
PCIe interconnects.
2.2 InfiniBand
InfiniBand is an open set of interconnect standards and specifications. The main InfiniBand
specification is published by the InfiniBand Trade Association (IBTA).
The InfiniBand Architecture (IBA) is an industry-standard architecture for server I/O and
inter-server communication. It was developed by the IBTA to provide the level of reliability,
availability, performance, and scalability that is necessary for present and future server
systems with levels better than can be achieved by using bus-oriented I/O structures.
InfiniBand is a solution that ranges from the hardware to the application layer. It is developed
by the OpenFabrics Open Alliance. It is an open industry-standard specification and
independent of the host operating system and hardware platform. Here are some of its
benefits:
Low latency
Simplified management
High bandwidth
Quality of service (QoS) enabled
Scalable
Supports CPU offloading
Remote Direct Memory Access (RDMA)
Lossless link level flow control
2.2.1 Technology
Performance is one of the important metrics of the InfiniBand technology, and it depends on
bandwidth and latency. The numbers that are used in this section are examples.
The bandwidth of an InfiniBand connection is measured in bits per second, and is known as
the link rate, which is calculated by the following equation:
The link speed of InfiniBand has increased over the years with each new generation of the
technology that is released. A summary of this progress is shown in Table 2-1.
The link width of a InfiniBand connection is the number of lanes (2 for send, 2 for receive)
inside a cable. The number of possible lanes vary, such as 1, 4, or 12, but currently 4 is used
in production.
The encoding factor describes the relationship between the size of the whole InfiniBand
package and the actual size of the payload data. It is calculated by the following equation:
PayloadSize
--------------------------------------
PackageSize
For example, if you use the InfiniBand technology generation FDR10 (2011), the
encoding factor is calculated as follows:
64Bits
----------------- = 0, 96
66Bits
The result is close to 0,97, and in some scenarios it can be disregarded for performance
calculation. In newer technology generations, such as EDR, the link speed is increased so
that the link rate reamained the same.
Table 2-2 provides the available and future link rates for InfiniBand connections.
Starting with FDR10, there is an increase in the Link rate in relation to the link speed. Another
thing to mention is the HDR100. It is an HDR link with a width of 4 divided, by a Y-cable, into
two links of two link width each.
Latency is the time that it takes for a packet to get from source to destination. It is also
improved from each generation to the next. Table 2-3 shows the latency decreasing over the
years with each new technology.
Eventually, latency cannot be reduced any further because nothing is faster than light. The
length of the cable is at least one of these factors.
2.2.2 Topology
There are several different types of topologies that you can use as part of a network, but there
are only a few that are useful. For common workloads, it is important that the topology is
homogeneous and balanced. In this section, the following five topologies are mentioned:
Fat-tree
Dragonfly
Mash (two, three, and so on dimensions)
Torus (two, three, and so on dimensions)
Hypercube
L-2
L-1
As shown, each node from the L-1 is connected to each node in L-2. This topology makes
your networks homogeneous. This topology provides a fully non-blocking network if the
uplinks have the same throughput as the links that are going to the nodes.
Figure 2-3 shows a three-level fat-tree topology that is a hierarchical architecture. The nodes
on the L-1 are divided into two groups. A group is defined as an island.
L-3
L-2
L-1
Up/Down (UpDn) and fat-tree are the most commonly used InfiniBand routing algorithms.
They provide a level of routing to adapt the load according to fabric necessities. UpDn uses
the shortest path available between access switches. In the fabric, all switches participate and
loop-free routing is ensured.
With the UpDn routing algorithm, the latency does not increase in the L-2 subtree; if L-3 is
needed to reach the destination, an extra hop increases the latency.
Sometimes, this configuration is not possible, and the blocking factor changes, which leads to
over-subscription. Over-subscription occurs when the applications on the cluster allow it to
happen or when technical requirements can be met. Normally, inside an island, which is a
grouping of nodes that serves a purpose or are in the same rack or line of racks, a fat-tree
topology is always non-blocking. When you interconnect the islands, over-subscription comes
in different forms, sometimes (for example) 1:2, 1:3, 1:4, and so on, which means that, using
the previous example of the 40-port 1U access switch, that for every 30 ports coming from the
nodes to the access switches, only 10 go to the spine switches.
Dragonfly
The Dragonfly topology is almost the same as the fat-tree topology, but it differs at the higher
level. Instead of adding another row of switches, there are only cable connections.
L-2
L-1
If you have more than three top-level switches in the Dragonfly topology, it is not fully
non-blocking because the number of uplinks to a specific subtree is smaller than the links that
that are going down.
A multidimensional mesh is possible, but with each dimension the uplink is increased by 2.
This topology makes sense only for some appropriated applications, and is not recommended
for larger scales because the number of hops increases.
.
2D
3D
2D
3D
Switches
Switches are connected inside a network to forward packages from one node to another
node. InfiniBand and Ethernet switches are based on application-specific integrated circuit
(ASIC) technology. Depending on the generation of the technology, each ASIC has a specific
number of connection ports with the same link rate. There are no special uplink ports, as on
Ethernet, with a different link speed. There are two different types of switches that are
available for InfiniBand:
Edge switches Edge switches are small device, for example, one unit of a standard
rack. The internal topology of a switch is non-blocking and has
bidirectional bandwidth. There are unmanaged or managed switches.
Director switches Director switches are a non-blocking modular chassis consisting of
several ASICs that are also used in edge switches. They offer high
port density with high bandwidth.
Cables
There are two different types of cables for each generation of the technology.
Passive copper cables Passive copper cables use Quad Small Form-factor
Pluggable (QSFP)-like connectors that are 2 - 7 meters,
depending on the link rate.
Active optical cables Active optical cables use QSFP connectors that are 100 -
300 meters, depending on the link rate.
2.2.4 Tools
Several versions of the InfiniBand hardware and software stack have been made available.
The need for a standardized stack led to the creation of the OpenIB Alliance, now known as
the OpenFabrics Alliance, which created the Open Fabrics Enterprise Distribution (OFED). It
is the most common software distribution for InfiniBand. The OFED stack also includes
several higher-level protocols that are not part of the initial IBTA specification.
The following list describes the most common command-line tools of the OFED.
ibstat Shows information about the local InfiniBand interface.
ibstatus Almost the same as ibstat, but with other metrics, for example, the
LID is shown in hexadecimal.
ibping Like the ping command, but must be run on both the server and client.
On the server, run ibping -S, and on the client, run ibping -L <LID>.
ibtracert Traces the path between 2 ports that are identified by their LIDs.
The administration network is not limited to management if it has the correct level of security.
If so, you can use it for other operations, such as wide area network (WAN) access, bridging
to the production network, and other functions.
Most of the tools operate in this network and can give you information about problems in
nodes. Through this network, you access the cluster for troubleshooting. If one of the
switches fails, you can recover it through the management network if it is not because of a
hardware problem.
Regarding the speeds that are supported for the fabric, many vendors offer 100 Gbps
switches with a lower price than other proprietary solutions, which makes Ethernet an
attractive solution because it works on almost any environment. For example, on storage
solutions, there is a transition from Fibre Channel only networks to Fibre Channel over
Ethernet (FCoE) solutions that converge the infrastructure.
Inside an HPC system that uses Ethernet for the fabric, the most common protocol for
communication is RDMA, which provides direct memory access from the memory of one
node to the memory of another node without going through either the CPU or the kernel layer
stack. The CPU does not become as involved as with common network workload, and can be
used for other tasks. There is little latency in the communication and performance is
improved.
RoCE was developed so that nodes can communicate on an Ethernet fabric. There are two
versions:
RoCEv1 is a link layer protocol (L2) that has the limitation that the protocol works only on
the same broadcast domain.
RoCEv2 has more focus on Layer 3 (L3) because it runs on UDP port 4791 for IPv4 and
IPv6, which means that it can work on any low-latency, high-bandwidth switch and be
routable across the network. RoCEv2 is not recommended if there are no low-latency
routers.
2.4 Storage
Shared storage architectures and their performance rely on the interconnect network. The
following storage solutions are commonly used for HPC clusters that use IBM Power Systems
servers.
It provides high availability through advanced clustering technologies, dynamic file system
management, and data replication. If there are server or cluster malfunctions, it can continue
to provide data access. Its scalability and performance are designed for data-intensive
applications, such as data mining, data analytics, seismic data processing, scientific
research, and scalable technical computing.
IBM Elastic Storage Server supports high-speed Ethernet and InfiniBand I/O adapters to
interconnect to the HPC fabric.
2.5 Applications
HPC and AI applications rely on the complete infrastructure. An interconnect network is a
critical performance factor because of its usage of massive parallel input/output transactions.
Although bandwidth is the most common metric, latency is in most use cases the critical part.
Also, the simplicity of the programming model is worth describing. This section points out two
different application basics for HPC and AI.
A non-optimized environment can hinder competitiveness and slow time to results. Rather
than have the application handle architectural differences in your infrastructure, IBM
Spectrum MPI manages them. It eliminates the need to write multiple versions of the
application to account for different interconnects. It also optimizes application performance by
improving collective algorithms.
The software features an Open MPI implementation for HPC parallel applications with
improved performance and scalability. It is supported on IBM Power Systems and brings a
collective MPI library and point-to-point communications protocol (Parallel Active Messaging
Interface (PAMI)) back end that provides improved network connectivity and enhancements to
application developer usability. It maximizes network efficiency by dynamically selecting the
optimal network connection between each node at run time.
It also delivers an improved, RDMA-capable PAMI by using InfiniBand with OFED on IBM
POWER hardware. It also offers a superior collective library that supports the seamless use
of GPU memory buffers for the application developer. The library features advanced logic to
determine the fastest algorithm for any given collective operation.
The IBM PowerAI solution is optimized for performance by using the NVLink-based IBM
POWER servers for HPC. The stack also comes with supporting libraries, such as DIGITS,
OpenBLAS, Bazel, and, NCCL.
In the fourth quarter of 2016, IBM announced a revamp of IBM PowerAI, seeking to address
some of the bigger challenges facing developers and data scientists. The goals were to
reduce the time that is required for AI system training, making and running a snap with an
enterprise-ready software distribution for deep learning and AI, and simplifying the
development experience. The idea behind IBM PowerAI is to embrace and extend the open
source community, embrace and extend the capability and creativity that is happening there,
and add IBM unique capabilities. This revamp manifests itself in a number of value
differentiators for AI applications that IBM Power Systems brings to the table.
The inherent challenge with DDL systems is that as the number of learners increases, the
amount of computation decreases at the same time the amount of communication remains
constant, which results in unwanted communication ratios.
To mitigate the impact of this scaling problem, IBM PowerAI DDL uses an innovative multi-ring
communication algorithm that balances the communication latency and communication
impact. This algorithm adapts to the hierarchy of communication bandwidths, including
intranode, internode, and interrack, within any system. This implementation enables IBM
PowerAI DDL to deliver the optimal DDL solution for a given environment.
The current implementation of IBM PowerAI DDL is based on IBM Spectrum MPI because
IBM Spectrum MPI itself provides many of the required functions, such as scheduling
processes and communication primitives in a portable, efficient, and mature software
infrastructure. IBM Spectrum MPI specifically provides functions and optimizations to IBM
Power Systems and InfiniBand network technology.
In a 256 GPU environment, ~90 GB of data must be transmitted to perform a simple reduction
operation, and the same amount of data must be transmitted to copy the result to all the
GPUs. Even a fast network connection with a 10 GBps transfer rate, bringing this data to a
single parameter server can take 9 seconds. IBM PowerAI DDL performs the entire reduction
and distribution of 350 MB in less than 100 ms by using a communication scheme that is
optimized for the specific network topology.
A key differentiator for IBM PowerAI DDL is its ability to optimize the communication scheme
based on the bandwidth of each available link, topology, and latency for each segment. In a
heterogeneous environment, with multiple link speeds, IBM PowerAI DDL adjusts its
dimensions to use the fastest available link, thus preventing the slowest link from dominating
the run time.
Using IBM PowerAI DDL on cluster of 64 IBM S822LC for High Performance Computing
(HPC) systems, we demonstrate a validation accuracy of 33.8% for Resnet-101 on Imagenet
22k in ~7 hours. For comparison, Microsoft ADAM 1 and Google DistBelief 2 were not able to
reach 30% validation accuracy for this data set.
For more information about IBM PowerAI, see IBM PowerAI: Deep Learning Unleashed on
IBM Power Systems Servers, SG24-8409.
This chapter show examples of HPC systems in three different sizes to illustrate the process
of their design and scaling. The examples describe solutions for small, medium, and large
systems.
Technical details about the nodes are not part of the description, although the necessary
calculations were done to confirm that this is a viable solution.
Initially, a single InfiniBand switch is used to interconnect the nodes and create a small
design. This first stage includes 10 nodes that are interconnected through a 40-port 1U
InfiniBand switch.
Architecture overview
The design includes the following hardware components:
Two management nodes
Six compute nodes
One login node
One storage node
One InfiniBand 40-port 1U switch
One Ethernet switch for management of the components
One Ethernet switch for the storage network
The overview shows that all compute nodes of the HPC/AI systems cluster are connected to
the InfiniBand switch, the storage connection uses Ethernet for data access, and you use the
management network for system administration and out-of-band (OOB) access.
Login node
InfiniBand 1G Ethernet
Network Switch
Compute
Compute
node
Compute
node
Compute
node
Compute
node
Compute
node
40 port InfiniBand
node
Switch Ethernet
Network
Management node
Management node
10G Ethernet
Switch
Storage node
In this example, a 10 Gb Ethernet switch connects the storage node. An advantage is that an
external storage system can be attached over that network to the cluster. This example
demonstrates an option of how an existing enterprise storage solution can be combined with
an InfiniBand based architecture.
Login nodes
1 & 10 Gb Ethernet switch
Management nodes
Storage node
Both images, the architectural overview and network diagram, should illustrate how different
system connections are established.
InfiniBand switch
IB
NE
T
PS
U
F
N
A
FAIL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Ethernet switches
Login node
Compute nodes
In this example, only 21 RU are used in the rack. The spaces between the nodes and
switches are included in the figure for better visualization of the components.
More servers can be placed into the available rack space if there are no environment
limitations in terms of weight, power, and cooling. The network and InfiniBand switches are
placed in the middle of the rack to optimize the cable length between the components within
the rack. But, if one of these constraints cannot be met, then an extra rack is required. Then,
it is necessary to plan carefully the cable routing between the racks to optimize the cable
length and corresponding costs. Depending on the characteristics of the data center, cables
can be managed above the rack in a Top Of Rack (ToR) tray, below the rack through a raised
floor, or straight through the middle of the rack side wall if the mechanical design supports
such a cable passage.
The next threshold in terms of growth is the number of switch ports. Typically, an InfiniBand
switch has 36 or 40 (High Data Rate (HDR) and beyond) ports. After these ports are all used,
a new topology must be established. In our example, a non-blocking fat-tree topology is used,
which means that a leaf switch must provide the same number of ports (as uplinks) and
compute nodes. So, up to 20 ports per leaf switch are available only for compute nodes. At
least three leaf switches are required to increase the number of ports, compared to a single
switch. In addition, two spine switches are necessary to interconnect the leaf switches.
Figure 3-4 illustrates a fat-tree fabric of five switches. That fabric provides 60 free ports on the
lowermost leaf switches. To extent that fabric, a fourth leaf switch can be added to the 20
unused ports of the uppermost spine switches, which increases the number of compute ports
to 80. Further extensions require more spine and leaf switches.
IB IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
NET NET
PSU PSU
FAN FAN
FAIL FAIL
IB IB IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
NET NET NET
This example shows that growing an HPC solution is not a linear effort, so you must be aware
of the thresholds where extra costs appear.
In stage two of our small solution, we reengineered the network to support the new
requirements. After we resized the new solution, we implemented the architecture that is
described in “Architecture overview”.
Architecture overview
The enhanced design includes the following hardware components:
Two management nodes for HA
Fifty compute nodes
Two login nodes
Two storage nodes
InfiniBand InfiniBand
Switch Switch
InfiniBand 1G Ethernet
InfiniBand InfiniBand
Switch Switch
Switch Switch
Login node
Login node
InfiniBand
Network
Compute
node
50x Ethernet
Network
Compute
node
Storage node
Storage node
Management node
Management node
The overview illustrates that all compute nodes are connected to the InfiniBand switches for
the HPC/AI systems cluster and that the storage nodes are now using InfiniBand for internal
cluster communication. To remove the costs of an extra 10-Gigabit Ethernet network, external
storage is connected to the storage nodes only through Ethernet. InfiniBand now covers the
application and storage traffic. With the increased number of compute nodes, the demand on
storage performance increases. To support storage bandwidth and latency needs, a second
storage node is added.
The management network is still in place for compute administration and OOB administration.
InfiniBand switches
IB IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
NET NET
PSU PSU
FAN FAN
FAIL FAIL
1 Gb Ethernet switch
IB IB IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
NET NET NET
PSU PSU PSU
FAN FAN FAN
FAIL FAIL FAIL
Compute nodes
Management nodes
Storage nodes
IB
NE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
InfiniBand switch
T
PS
U
FA
N
FAIL IB
IB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 NET
IB NET
NE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 PSU
PSU
T
PS FAN
U
FA FAN
FAIL
N FAIL
FAIL
IB
NE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
T
PS
U
FA
N
FAIL
Compute nodes
With the highly increased number of nodes, it becomes feasible to create a structure of
multiple node groups, which are called islands. The islands concept enables subdivided
management and maintenance because each island can be handled as own entity. From an
application perspective, the island behaves as one large cluster, but each island can support
special needs, such as more memory or acceleration hardware. If multiple software
environments are needed, the solution can support different operating systems, scientific
libraries, or software levels. A job scheduler should be able to take island boundaries into
account for optimized job placement, and use the islands to offer environments for long and
short running jobs, large memory, accelerators, or other special purposes.
To plan the InfiniBand topology and determine the optimal size of the islands, you must review
the following application requirements:
Number of nodes running one parallel application
Interconnect bandwidth requirements of the applications
Variation of hardware requirements (memory size, CPU/core counts, and accelerators)
In addition, even full cluster jobs often do not require the full interconnect bandwidth because
they are limited by other limits, such as the CPU/GPU floating point operations per second
(FLOPS) or memory bandwidth.
A fully non-blocking fat-tree topology greater than a specific size can be replaced by a director
switch. Figure 3-8 shows a fat-tree non-blocking InfiniBand Architecture (IBA) that is
equivalent to a director switch.
INB
P
E
F
S
T
A
UAI
F
N
L
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
20x
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
^
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
L-1
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
=
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
IB IB
NE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 NE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
T
PS T
PS
U
FA U
FA 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
N N 0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
FAIL FAIL
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
40x
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
L-2
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
IB IB 0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
NE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 NE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
T
PS T
PS
U
FA U
FA
N N
FAIL FAIL 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 5 6 7 8 9 0
In this example, there is an InfiniBand network with 800 access ports. With a classic topology,
60 ToR switches and 800 switch-to-switch cables in between are required. In a director
switch, these switch-to-switch cables are replaced by a hardwired midplane in the middle of
the chassis. Furthermore, a switch is composed of four different module types:
Leaf-switching module This module is in the front of the chassis and contains the
access ports. Half of the ports are externally connected, and
the other half are internally connected.
Spine-switching module This module is in the back of the chassis and contains only
internally connected ports. There is only half as many spine
modules than leaf modules in the chassis.
Management module This module is responsible for the management of the
switching modules in the chassis. A high availability capability
creates redundancy, so there are usually two of them.
Power supply unit This module provides the electrical source of the chassis.
Because of redundancy, there are several of them in the
chassis.
A supplement to the nodes and the interconnect, there is an Ethernet network for OOB
hardware management and administration purposes.
Designing an interconnect network for such an infrastructure is challenging. In this paper, the
design is a simplified representation, and further customer-specific characteristics are not
considered.
10 x 10 x 10 x
Login nodes Service nodes Storage nodes
InfiniBand InfiniBand
Switch Switch
20 x InfiniBand
InfiniBand Network
Switches
So, each island has 600 nodes that are managed by two service nodes. The management
infrastructure is hierarchically organized with two service nodes acting as the master node,
and two subordinate nodes that are responsible for each island. The nodes are set up in pairs
for high availability. The hierarchical concept is necessary to manage the cluster efficiently
and scale the service infrastructure in the same way as the compute resources.
Network diagram
There are two approaches to building a fat-tree topology
Use three hierarchy levels and group the nodes into islands. ToR switches are placed at
the upper level of the fat-tree topology and director size switches at the lower level. This
approach limits the node count of a single island because the size of the director switch
and the blocking factor determines the maximum number of free ports and nodes.
Place director switches at the upper level and ToR switches at the lower level to create
larger island sizes, as described in 3.1.4, “Large solution upgrade” on page 43.
Storage nodes
Login & service nodes
IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
NET IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
PSU NET
FAN PSU
FAIL FAN
FAIL
IB
NET
PSU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
20x IB
NET
PSU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
FAN FAN
FAIL FAIL
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 0 1 2 3 4 5 67 8 9 0 1 2 3 4 5 67 8 9 0 1 23 45 67 8 9 0 1 2 3 4 5 67 8 9 0 1 2 3 4 5 67 8 9 0
1 2 3 4 5 6 7 89 0 1 2 3 45 6 7 8 9 0 1 2 3 45 6 7 8 9 0 1 23 45 6 7 89 0 1 2 3 45 6 7 8 9 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
11 1 1 1 1 1 1 11 2 1 1 1 1 11 1 1 1 1 2 11 1 1 11 1 1 1 1 2 11 11 11 1 1 11 2 1 1 1 1 11 1 1 1 1 2 1 1 11 1 1 1 1 1 1 2
1 23 4 5 6 7 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9 1 23 45 67 8 9 1 2 3 4 5 67 8 9 1 2 3 4 5 67 8 9
01 2 3 4 5 6 7 89 0 0 1 2 3 45 6 7 8 9 0 01 2 3 45 6 7 8 9 0 01 23 45 6 7 89 0 0 1 2 3 45 6 7 8 9 0 0 1 23 4 5 6 7 8 9 0
Architecture overview
The HPC/AI system is extended by using 1200 more nodes in one island. To support the
increased number of compute resources, the service and storage infrastructure must be
expanded as well. Therefore, the showcase doubles the number of login, service, and storage
nodes. In a real scenario, these numbers must be validated by performance measurements.
10 x 10 x
Login nodes 10 x
Service nodes Storage nodes
InfiniBand InfiniBand
Switch Switch 2x
InfiniBand
Switch
28 x InfiniBand
InfiniBand Network
Switches
InfiniBand InfiniBand
Director Director
Switch Switch
IB IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
NET NET IB IB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
PSU PSU NET NET
IB
NET
PSU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
28x IB
NET
PSU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
FAN FAN
FAIL FAIL
NB
IF
P NB
IF
P NB
IF
P
F
E F
E F
E
S
A S
A S
A IN
A
T A
T A
T FB
P
F IN
FB
P
IU
N IU
N IU
N E
S
A
F
E
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 A
T S
A
A
L L L U
1 23 45 67 8 9
01 23 45 6 7 89 0
1 23 45 67 8 91
01 23 45 6 7 89 0
123 45 67 8 9
01 23 45 6 7 89 0
1 23 4567 8 91
0123 45 6 7 89 0
123 45 67 8 9
0123 45 6 7 89 0
1 23 4567 8 91
01 23 45 6 7 89 0
I
N IT
UN
L 1 11 11 1 1 11 2 1 11 11 1 1 11 2 L 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0 1 11 11 1 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 11 111 1 11 2 1 11 11 1 1 11 2
01 23 45 6 7 89 0 01 23 45 6 7 89 0 1 23 45 67 8 91 1 2345 67 8 91
11 11 11 1 1 11 2 1 11 11 1 1 11 2 11 11 11 1 1 11 2 111 11 1 1 11 2 1111 11 1 1 11 2 1 11 11 1 1 11 2 01 23 456 7 89 0 01 23 45 6 7 89 0
1 23 45 67 89 1 23 45 67 8 91 123 45 67 89 1 23 4567 8 91 123 45 67 89 1 23 4567 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 45 6 7 89 0 0123 45 6 7 89 0 0123 45 6 7 89 0 01 23 45 6 7 89 0
1 11 11 1 1 11 2 1 11 11 1 1 11 2 1 11 111 1 11 2 1 11 11 1 1 11 2
1 23 45 67 8 91 1 23 45 67 8 91 1 23 45 67 8 91 1 2345 67 8 91
01 23 45 6 7 89 0 01 23 45 6 7 89 0 01 23 456 7 89 0 01 23 45 6 7 89 0
IB
NET
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
60 x IB
NET
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
PSU PSU
FAN FAN
FAIL FAIL
The proposed design is based on an upgrade scenario where the new island must be
integrated into the existing setup. Other variations are possible, for example, the new island
can be used in a stand-alone approach.
A HPC/AI solution must fulfill the application needs. CPU performance and core counts are
steadily increasing and accelerators are boosting the compute performance. But, parallel
HPC/AI applications must exchange the local computed results with other nodes, so the ratio
between computation and interconnect performance becomes more important. There is no
unique solution for all needs; you must carefully validate the bandwidth and latency needs to
find the correct one. This chapter demonstrated some examples to start with, but
modifications are necessary to meet the necessary characteristics.
Furthermore, the blocking factor is an important value, which should be chosen based on
application characteristics.
As described in 2.3, “Ethernet” on page 22, LAN switches are used for this network because
they provide an affordable, stable, and secure infrastructure for the management of the
cluster.
Considerations
The amount of equipment that is involved in an HPC cluster might be low, medium, or high.
Larger clusters might have a large amount of hardware that requires management. Providing
a network that can meet all your requirements is challenging because most of the
requirements are related to the switches that you select, the topology that you use, and how
the cluster is managed.
This network is the path to every device in your cluster, so correctly sizing the hardware and
choosing the proper topology helps you avoid these and other similar problems on the
management network.
Approaches
The following three approaches describe the design of an administration and OOB network.
Each approach accounts for all of the points that are described in Chapter 1, “Understanding
the requirements” on page 1, and uses different topologies and technologies. Other options
are available, although the following ones are based on our expertise:
Approach 1 There are LAN switch providers that integrate fabric extenders into
their portfolio, which enables individual switches to act as an extension
of other more robust switches, which provide many benefits to the
infrastructure.
For example, a network that uses two InfiniBand director switches with
500 nodes each for a single HPC solution is a good scenario to
illustrate this approach.
Switches with large MAC address tables, many CPUs, and a large
amount of memory must be manageable. You can manage this
network by using various tools, such as a single management
dashboard that is provided by proprietary software, which enables you
to monitor and configure correctly the equipment.
In this example, you use two InfiniBand director switches with 500
nodes each on a single solution, with three stacks of eight switches of
48 ports each, with only two or four ports going up to the core from
each stack. This solution requires fewer cables to the core, good
performance on the stack, redundancy, less administration, and
enough space to grow.
All three approaches provide a common OOB management network that is accessible, has
proper performance and room for growth, and can include other services, such as bridging to
the production network or access to the wide area network (WAN).
Recommendations
Designing the correct administration or OOB network is the first step. The second step is to
provide the following minimum requirements:
Servers where the support and management tools are going to be based should be
accessible from any location inside and outside the OOB.
Include failover for connectivity.
Ensure that all tools have what they need to access the nodes, including separation
through VLANs.
Increase business continuity by creating the correct recovery plan for the fabric and the
OOB.
Keep inventory and equipment locations current.
These minimum requirements are only part of a broad range of options to consider.
Conclusion
Both OOB and HPC networks are completely, physically, and utterly separated regarding
traffic and functions. Ensure that the administration or OOB network is a powerful
management tool because it is available even if the entire HPC cluster fails.
The publications that are listed in this section are considered suitable for a more detailed
description of the topics that are covered in this paper.
IBM Redbooks
The following IBM Redbooks publications provide more information about the topics in this
document. Some publications that are referenced in this list might be available in softcopy
only.
IBM PowerAI: Deep Learning Unleashed on IBM Power Systems Servers, SG24-8409
IBM Power System AC922 Introduction and Technical Overview, REDP-5472
HPC Clusters Using InfiniBand on IBM Power Systems Servers, SG24-7767
Implementing InfiniBand on IBM System p, SG24-7351
You can search for, view, download, or order these documents and other Redbooks,
Redpapers, web docs, drafts, and additional materials, at the following website:
ibm.com/redbooks
Online resources
These websites are also relevant as further information sources:
IBM Power Systems, Storage, and Applications:
– https://www-03.ibm.com/systems/power/solutions/bigdata-analytics/smartpaper/
– https://www.ibm.com/us-en/marketplace/ibm-elastic-storage-server
– https://www.ibm.com/us-en/marketplace/scale-out-file-and-object-storage
– https://www.ibm.com/us-en/marketplace/ibm-elastic-storage-server
– https://www.ibm.com/us-en/marketplace/spectrum-mpi
– https://www.ibm.com/us-en/marketplace/deep-learning-platform
InfiniBand, RDMA over Converged Ethernet (RoCE), and Ethernet:
– https://en.wikipedia.org/wiki/InfiniBand
– https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
– https://en.wikipedia.org/wiki/Remote_direct_memory_access
– http://www.mellanox.com/page/products_dyn?product_family=79
– http://www.mellanox.com/related-docs/solutions/deploying-hpc-cluster-with-mel
lanox-infiniband-interconnect-solutions-archive.pdf
– https://community.mellanox.com/docs/DOC-2402
– http://www.roceinitiative.org/
– https://www.cisco.com/c/dam/en_us/solutions/industries/docs/education/etherne
t-solutions-high-performance-computing-education.pdf
REDP-5478-00
ISBN 0738456837
Printed in U.S.A.
®
ibm.com/redbooks