Gaudi2 Whitepaper
Gaudi2 Whitepaper
WHITE PAPER
JUNE 2022
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | May 2022
HABANA® GAUDI®2 WHITE PAPER 2
Table of Contents
I. INTRODUCTION ......................................................................................................................................................... 3
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 3
I. INTRODUCTION
The main benefits that current customers of our first-generation Gaudi see
are the price-performance advantages relative to GPU solutions for popular
vision and language models. These enable customers to train more and pay
less, and this way, accelerate time-to-market with their model training.
With Gaudi2, we are pleased to extend these benefits to not just better
price-performance, but also performance leadership vs. the leading shipping
7nm GPU today, namely the Nvidia A100.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 4
Before we address the architecture details, here arekey benchmarks for Gaudi2 at
time of publication of this white paper and whichreflect the SynapseAI® Software
Suite Release 1.5 effective June 16, 2022
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 5
A100-80GB: Measured by Habana on Azure instance Standard_ND96amsr_A100_v4 using single A100-80GB with TF docker 22.03-tf2-
py3 from NGC (Phase-1: Seq len=128, BS=312, accu steps=256;
Phase-2: seq len=512, BS=40, accu steps=768)
A100-40GB: Measured by Habana on DGX-A100 using single A100-40GB with TF docker 22.03-tf2-py3 from NGC (Phase-1: Seq len=128,
BS=64, accu steps=1024;
Phase-2: seq len=512, BS=16, accu steps=2048)
V100-32GB: Measured by Habana on p3dn.24xlarge using single V100-32GB with TF docker 21.12-tf2-py3 from NGC (Phase-1: Seq
len=128, BS=64, accu steps=1024;
Phase-2: seq len=512, BS=8, accu steps=4096)
Gaudi2: Measured by Habana on HLS-Gaudi2 system using single Gaudi2 with SynapseAI TF docker 1.5.0 (Phase-1: Seq len=128,
BS=64, accu steps=1024;
Phase-2: seq len=512, BS=16, accu steps=2048)
Habana reported these results on models ported from Gaudi to Gaudi2 and based
on the SynapseAI® Software Suite release 1.5 While our OEM partners are working
on building servers for general consumption, Habana’s engineers are working these
days on porting and developing additional deep-learning models, with a cadence of
6-8 weeks for software releases. You can track our progress through our public
Github and developer.habana.ai site.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 6
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 7
The below diagram shows what is usually observed on GPUs where the GEMM
compute and the general-purpose cores execution time is not overlapping:
GEMM GEMM
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 8
On Gaudi and Gaudi2 architectures, the MME and TPC compute time overlaps.
The GEMM and none-GEMM operations are mostly overlapped, dramatically
accelerating the workload.
GEMM GEMM
Another big difference between GPU and Gaudi architecture is the size of the matrix
multiplication accelerator. This fact, which may seem minor, has a big effect on overall
ability to utilize those accelerators, specifically when matrix sizes become smaller.
The below diagram compares a 256x256 matrix accelerator (on the left) to 256 small
16x16 matrix accelerators on the right (the depth dimension was removed to simplify
the explanation). From compute perspective, both are equivalent, but from bandwidth
perspective, although the left one requires 512 input elements per cycle to utilize the
compute, the right side requires 8K input elements per cycle to utilize the compute.
A 16x difference on the read bandwidth requirements towards the first level memory.
256
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
256 Elem/Cycle 256x256 256x16 Elem/Cycle 16x16
16x16
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 9
GPUs, which implement the right side, compensate for the above phenomena, by
mandating a large reuse factor from the hierarchical caching memory sub system,
such that overall, to utilize their multipliers, they require a very big matrix
multiplication problem to solve. Gaudi2 (and some other dedicated tensor
processors), which implement the left side approach, can utilize their multipliers
easily while leaving a lot of free bandwidth & capacity from their flat memory
subsystem for other tasks, besides matrix multiplications.
Such high utilization on small tensors significantly eases MME & TPC computation
overlapping, as to allow such tight overlapping, an operation needs to be sliced as
described below, which creates smaller tensors for MME & TPC to operate upon:
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 10
Gaudi2 integrates Habana’s fourth generation extending the RoCE scalability with flexible time-
Tensor Processor Core. The TPC is a general purpose based congestion control algorithm, enabling linear
VLIW processor which is 256B SIMD wide and scalability over thousands of Gaudi system.
supports FP32, BF16, FP16 & FP8 (Both E4M3 and
E5M2), in addition to INT32, INT16 & INT8 data DNN topologies tend to use collective operations
types. As opposed to common DSPs, which require a extensively and posting collective operations on
DMA to fetch in and out the operands to a local multiple ports usually requires high CPU
SRAM, the TPC, using advanced micro-architectural horsepower. To reduce the CPU utilization a scalable
techniques, exposes a DMA-free programming collective offload was introduced on Gaudi2 which
model which significantly eases SW development. In helps Gaudi’s 2 message rate to be more than an
addition, the same advanced microarchitecture order of magnitude better than competition.
allows bubble-free execution between kernels which Gaudi’s NICs are also aligned with all other engines
effectively makes TPC 100% utilized on tensor in chip and can access both local and remote
processing, even for very short kernels, regardless of memory in tensor semantics.
the location of its inputs/outputs (SRAM or HBM).
Just like MME, TPC is also very efficient in working
on small tensors.
To summarize, Gaudi heterogenous
As deep learning training is usually solved on
multiple devices, Gaudi2 Network Interface architecture is unique in the sense that
controllers (NICs) are an essential component in the
overall Habana second-generation training solution.
it is highly efficient on small tensors’
Gaudi’s NIC is customized to fit a distribution of a operations, which is an enabler for
DNN graph between the chips in the network (AKA
scale-out). The NIC provides the compute engine overlapping the computation &
with a remote direct memory access (RDMA)
featuring high bandwidth and low latency over networking communication between
reliable connection without any software
the heterogenous agents, in addition to
intervention. To fit common cloud infrastructure,
NIC ports use Ethernet connectivity with aggregated freeing up significant memory capacity
bandwidth of 2.4Tb/s, supporting multiple port
configurations. The NIC implements the RoCE v2 and bandwidth requirements from its
specification, benefiting from the commonly used
Ethernet infrastructure and the reliable and low memory subsystem.
latency RDMA of the InfiniBand protocol, while
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 11
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 12
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 13
Subgraphs with blue nodes are identified and performance on Gaudi. By default, we enable lazy
encapsulated. The original graph is modified to mode exaction. Instead of executing one operator
replace the subgraphs with their corresponding at a time, the SynapseAI bridge internally
encapsulated nodes. accumulates these in a graph.
The execution of the accumulated operators in the
The framework runtime then executes the modified graph is triggered in a “lazy” manner, only when a
graph. Per node, a corresponding SynapseAI graph tensor value is required by the user. This allows the
is created and compiled. For performance bridge to construct a graph, which provides the
optimization, the compilation recipe is cached for SynapseAI graph compiler the opportunity to
future use. After allocating memory, the recipe is optimize the device execution for the operators.
enqueued for execution on a SynapseAI stream.
Mixed precision execution is available via the
SynapseAI supports distributed training with Habana Mixed Precision (HMP) package. The HMP
TensorFlow using Horovod and tf.distribute API with package automatically modifies the Python
HPUStrategy. Mixed precision execution is available operators to add the appropriate cast operations,
via the tf.keras.mixed_precision API or using and this enables you to run mixed precision training
Habana’s automated mixed precision conversion. without extensive modifications to existing FP32
These enable you to run mixed precision training model scripts. SynapseAI PyTorch bridge supports
without extensive modifications to existing FP32 distributed training using torch.distributed and
model scripts. More details are available in the torch.nn.parallel.DistributedDataParallel APIs for
TensorFlow section on docs.habana.ai. both data and model parallelism. Distributed
communication is enabled using HCCL backend.
The SynapseAI PyTorch bridge interfaces between
the framework and SynapseAI software stack to For more details, check out the PyTorch section on
docs.habana.ai.
train PyTorch-based deep learning models on Gaudi.
We support two modes of execution: (1) Eager SynapseAI is also integrated with TensorBoard to
mode, which performs operator-by-operator enable debugging and profiling of your TensorFlow
execution as defined in standard PyTorch eager or PyTorch models. Users interested in low-level
mode scripts, and (2) Lazy mode, which performs focused profiling can refer to the SynapseAI Profiler
deferred execution of graphs comprising a User Guide on docs.habana.ai.
collection of operators. Lazy mode provides user
experience like Eager mode, while enabling high
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 14
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 15
At Habana, designing and developing running models on Gaudi. And for IT and Systems
Administrators building Gaudi-based systems on
the hardware for high-performance premise, we provide guidance on set-up and
and efficient DL processors accounts management for Gaudi servers and computing
for a relatively small portion of infrastructure.
Habana’s effort; the majority is Habana GitHub contains repositories open to the
dedicated to leveraging that general public, which include setup and install
instructions for Habana binaries and docker
hardware with the right software, creation, Jupyter notebook-based tutorials,
tools and support you need to make reference models, custom TPC kernel example,
your workloads and models run and more.
efficiently, with accuracy and speed. Our Model-References repository contains 30+
popular TensorFlow and PyTorch models that
In addition to the SynapseAI software suite that is have been ported to Gaudi, and the Model
designed for performance and usability, we have Performance page provides the latest
also published a wealth of information and performance results for these models. The
resources to make it easy for you to get started Habana Developer Site also has a searchable
with training on Gaudi processors. The Habana Catalog of SynapseAI container images,
Developer Site, is the hub for Habana developers TensorFlow and PyTorch reference models. For
from where you will find the content, guidance, more information on future model support,
tools, and support needed to build easily and please refer to our SynapseAI model roadmap
flexibly new or migrate existing AI models and page. Each model is supported with model scripts
optimize their performance on our AI processors. and instructions on how to run these models on
The Resources section contains a collection of Gaudi. We are committed to expanding our
documents, short videos and hands-on Jupyter model coverage continuously and providing a
notebook tutorials to help you get started and wide variety of examples for users.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 16
Orchestration Kubernetes
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 17
V. MODEL MIGRATION
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 18
Below we show the minimum set of changes required to port a TensorFlow Keras
model that does not contain any custom kernels.
import tensorflow as tf
from TensorFlow.common.library_loader import load_habana_module
load_habana_module()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)
The minimal changes to enable training on the Habana Gaudi device are highlighted
in bold. All you need is to import load_habana_module package and then invoke
the load_habana_module() function to enable training on Gaudi. With this change,
the Gaudi device, which is referred to as HPU in the framework, is now registered in
TensorFlow and prioritized for execution over CPU. When an operator is available
for both CPU and HPU, the operator is assigned to the HPU. When it is not
supported on Gaudi, it runs on the CPU. For more details on porting your
TensorFlow model to Gaudi processors, check out the TensorFlow Migration Guide.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 19
At Habana, our aim is to meet the developers where they are. We have
been busy collaborating with AI software ecosystem partners to enable a
seamless user experience with Habana AI processors.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 20
There are two main classes one needs to know: (1) GaudiTrainer class that takes
care of compiling (lazy or eager mode) and distributing the model to run on HPUs,
and of performing training and evaluation, and (2) GaudiConfig class to configure
Habana Mixed Precision and decide whether optimized operators and optimizers
should be used. The GaudiTrainer is very similar to the Transformers Trainer and
adapting a script using the Trainer to make it work with Gaudi will mostly consist in
simply swapping the Trainer class for the GaudiTrainer one. The example in the
picture above shows how simple it is to get started with training Transfomer models
on Gaudi. Several popular reference models are available on the HuggingFace
Habana page, including, bert-base, bert-large, roberta-case, roberta-large,
distilbert-base, albert-large and albert-xxlarge.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 21
...
# Loading the GaudiConfig needed by the GaudiTrainer to fine-tune the model on HPUs
gaudi_config = GaudiConfig.from_pretrained(
training_args.gaudi_config_name,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
trainer = GaudiTrainer(
model=model,
gaudi_config=gaudi_config,
# The training arguments differ a bit from the original ones, that is why we use
GaudiTrainingArguments
args=training_args,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=data_collator,
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 22
Below you will see how easy it is to get started with training on Gaudi using
PyTorch Lightning:
import pytorch_lightning as pl
All you need is to provide accelerator="hpu" parameter to the Trainer class, and
select the number of Gaudi processors by setting the devices parameter. For mixed
precision training, import the HPUPrecisionPlugin and set “precision=16”.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 23
With cnvrg.io, data scientists can deploy more models with drag and drop
machine learning pipelines. You can easily run and track experiments, and
automate your machine learning from research to production using reusable
components and drag-n-drop interface. Getting started with Habana Gaudi on
cnvrg.io first requires setting up a Kubernetes cluster for your on-premise
Gaudi servers or an Amazon EKS cluster using DL1 EC2 instances. cnvrg.io
seamlessly integrates both on-premises and cloud compute resources. The
Habana Vault, which hosts the SynapseAI TensorFlow and PyTorch Docker
container images, is integrated and available on cnvrg.io Registries. You can
now bring up a new Jupyter Workspace, select the appropriate Gaudi compute
and Docker image from the cnvrg.io Habana container registry. You can then
get started with the Habana reference models by simply adding the repo
location in the cnvrg Project Settings Git Integration page. Now you can start a
new Experiment in cnvrg.io and begin training your model on Gaudi.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 24
Habana has worked with both server, switch, and storage systems partners to make
it easy for end-customers to build AI racks and clusters.
The figure below shows a rack-scale configuration with four Gaudi servers
connected to a single Ethernet switch at the top of the rack. This switch can be
further connected to other racks to form a much larger training pod that can hold
hundreds or thousands of Gaudi processors.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 25
The DDN A3I scalable architecture integrates X12 Gaudi AI servers with DDN AI
shared parallel file storage appliances and delivers fully optimized end-to-end AI
acceleration on Habana Gaudi AI processors. DDN A3I solutions greatly simplify the
deployment of X12 Gaudi AI servers in single server and multi-server configurations,
while also delivering performance and efficiency for maximum Habana Gaudi AI
processors saturation, and high levels of scalability.
This section describes the components integrated in DDN A3I Solutions with
Supermicro X12 Gaudi AI servers.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 26
MANAGEMENT NETWORK
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 27
FIGURE 14. DDN A3I REFERENCE ARCHITECTURE WITH FOUR X12 GAUDI AI SERVERS
DDN AI400X2
8
Storage & Cluster
Management
Network Switch
2 6
2 6
Supermicro X12
Gaudi AI Servers
2 6
2 6
Gaudi Network
Switch
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 28
DDN AI400X2
8
Storage & Cluster
Management
Network Switch
2 6
2 6
Supermicro X12
Gaudi AI Servers
2 6
2 6
Gaudi Network
Switch
Additionally, the X12 Gaudi AI servers are connected through a network switch
for Gaudi communication. Every X12 Gaudi AI server connects to the Gaudi
network switch via six 400 GbE links.
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 29
Interface Description
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 30
Feature Description
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 31
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 32
QSFP-DD6
21 x 100G RoCE 21 x 100G RoCE
OAM6 OAM7
HL-225H HL-225H
3 x 100G RoCE 3 x 100G RoCE
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 33
X. HLS-GAUDI®2 Server
The HLS-Gaudi®2 system is a high-performance deep-learning server,
incorporating a dual socket Xeon host subsystem and 8 Gaudi2 accelerators,
which supports scaling out using 24x100GbE RDMA ports.
Feature Description
• 2* PCIe switch
HIB
• BMC + peripheral
• HLBA-225
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 34
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 35
Do
2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022